Introduction to Hadoop HDFS (and writing to it with node.js)

The core Hadoop project solves two problems with big data - fast, reliable storage and batch processing.
For this first post on Hadoop we are going to focus on the default storage engine and how to integrate with it using its REST API. Hadoop is actually quite easy to install so let’s see what we can do in 15 minutes. I’ve assumed some knowledge of the Unix shell but hopefully it’s not too difficult to follow - the software a versions are listed in the previous post.

If you’re completely new to Hadoop three things worth knowing are...

  • The default storage engine is HDFS - a distributed file system with directories and files (ls, mkdir, rm, etc)
  • Data written to HDFS is immutable - although there is some support for appends
  • HDFS is suited for large files - avoid lots of small files
If you think about batch processing billions of records, large and immutable files make sense. You don’t want the disk spending time doing random access and dealing with fragmented data if you can stream the whole lot from beginning to end. Files are split in to blocks so that nodes can process files in parallel using map-reduce. By default a Hadoop cluster will replicate each file block to 3 nodes and each file block can take up to the configured block size (~64M).

Starting up a local Hadoop instance for development is pretty simple and even easier as we’re only going to start half of it. The only setting that’s needed is the host and port where the HDFS master ‘namenode’ will exist but we’ll add a property for the location of the filesystem too.

After downloading and unpacking Hadoop add the following under the <configuration> tags in core-site.xml...

conf/core-site.xml:

<property>
  <name>fs.default.name</name>   <value>hdfs://localhost:9000</value>
</property>
<property>
  <name>hadoop.tmp.dir</name>   <value>/home/${user.name}/hdfs-filesystem</value>
</property>

Add your Hadoop bin directory to the PATH

export PATH=$PWD/hadoop-1.0.4/bin:$PATH

The only other step before starting Hadoop is to format the filesystem...

hadoop namenode -format

Hadoop normally runs with a master and many slaves. The master 'namenode' tracks the location of file blocks and the files they represent and the slave 'datanodes' just store file blocks. To start with we’ll run both a master and a slave on the same machine...

# start the master namenode
hadoop-daemon.sh start namenode
# start a slave datanode
hadoop-daemon.sh start datanode

At this point we can write some data to the filesystem...

hadoop dfs -put /etc/hosts /test/hosts-file
hadoop dfs -ls /test

You can check that the hadoop daemons are running correctly by running jps (Java ps). Shutting down the daemons can be done quickly with a ctrl-c or killall java - do this now.

To add data we’ll be using the WebHDFS REST api with node.js as both are simple but still fast.

First we need to enable the WebHDFS and Append features in HDFS. Append has some issues and has been disabled in 1.1.x so make sure you are using 1.0.4. It should be back in 2.x and should be fine for our use - this is what Hadoop development is like! Add the following properties...

conf/hdfs-site.xml

<property>
  <name>dfs.webhdfs.enabled</name>
  <value>true</value>
</property>
<property>
  <name>dfs.support.append</name>
  <value>true</value>
</property>

Restart HDFS...

killall java
hadoop-daemon.sh start namenode && hadoop-daemon.sh start datanode

Before loading data we need to create the file that will store the JSON. We’ll append all incoming data to this file...

hadoop dfs -touchz /test/feed.data
hadoop dfs -ls /test

If you see the message "Name node is in safe mode" then just wait for a minute as the namenode is still starting up.

Next download node.js (http://nodejs.org/download/) - if you’re using Unix you can ‘export PATH’ in the same way we did for hadoop.

export PATH=$PWD/node-v0.10.0-linux-x64/bin/:$PATH

Scripting in node.js is very quick thanks to the large number of packages developed by users. Obviously the quality can vary but for quick prototypes there always seems to be a package for anything. All we need to start is an empty directory where the packages and our script will be installed. I’ve picked three packages that will help us...

mkdir hdfs-example
cd hdfs-example
npm install node-webhdfs
npm install ntwitter
npm install syncqueue

The webhdfs and twitter packages are obvious but I’ve also used the syncqueue package so that only one append command is sent at a time - Javascript is asynchronous. To use these create and edit a file named twitter.js and add....

var hdfs = new (require("node-webhdfs")).WebHDFSClient({ user: process.env.USER, namenode_host: "localhost", namenode_port: 50070 });
var twitter = require("ntwitter");
var SyncQueue = require("syncqueue");
var hdfsFile = "/test/feed.data";

// make appending synchronous
var queue = new SyncQueue();

// get your developer keys from: https://dev.twitter.com/apps/new
var twit = new twitter({
  consumer_key: "keykeykeykeykeykey",
  consumer_secret: "secretsecretsecretsecret",
  access_token_key: "keykeykeykeykeykey",
  access_token_secret: "secretsecretsecretsecret"
});

twit.stream("statuses/filter", {"track":"hadoop,big data"}, function(stream) {
  stream.on("data", function (data) {
    queue.push(function(done) {
      console.log(data.text);
      hdfs.append(hdfsFile, JSON.stringify(data), function (err, success) {
        if (err instanceof Error) { console.log(err); }
         done();
      });
    });
  });
});

And run node twitter.js

Now sit back and watch the data flow - here we’re filtering on “hadoop,big data” but you might want to choose a different query or even a different source - eg. tail a local log file, call a web service, run a webserver.