Introduction to Hadoop HDFS (and writing to it with node.js)

March 22nd, 2013 by

The core Hadoop project solves two problems with big data – fast, reliable storage and batch processing.
For this first post on Hadoop we are going to focus on the default storage engine and how to integrate with it using its REST API. Hadoop is actually quite easy to install so let’s see what we can do in 15 minutes. I’ve assumed some knowledge of the Unix shell but hopefully it’s not too difficult to follow – the software a versions are listed in the previous post.

If you’re completely new to Hadoop three things worth knowing are…

  • The default storage engine is HDFS – a distributed file system with directories and files (ls, mkdir, rm, etc)
  • Data written to HDFS is immutable – although there is some support for appends
  • HDFS is suited for large files – avoid lots of small files

If you think about batch processing billions of records, large and immutable files make sense. You don’t want the disk spending time doing random access and dealing with fragmented data if you can stream the whole lot from beginning to end.
Files are split in to blocks so that nodes can process files in parallel using map-reduce. By default a Hadoop cluster will replicate each file block to 3 nodes and each file block can take up to the configured block size (~64M).

Starting up a local Hadoop instance for development is pretty simple and even easier as we’re only going to start half of it. The only setting that’s needed is the host and port where the HDFS master ‘namenode’ will exist but we’ll add a property for the location of the filesystem too.

After downloading and unpacking Hadoop add the following under the <configuration> tags in core-site.xml…

conf/core-site.xml:

Add your Hadoop bin directory to the PATH

The only other step before starting Hadoop is to format the filesystem…

Hadoop normally runs with a master and many slaves. The master ‘namenode’ tracks the location of file blocks and the files they represent and the slave ‘datanodes’ just store file blocks. To start with we’ll run both a master and a slave on the same machine…

At this point we can write some data to the filesystem…

You can check that the hadoop daemons are running correctly by running jps (Java ps). Shutting down the daemons can be done quickly with a ctrl-c or killall java – do this now.

To add data we’ll be using the WebHDFS REST api with node.js as both are simple but still fast.

First we need to enable the WebHDFS and Append features in HDFS. Append has some issues and has been disabled in 1.1.x so make sure you are using 1.0.4. It should be back in 2.x and should be fine for our use – this is what Hadoop development is like! Add the following properties…

conf/hdfs-site.xml

Restart HDFS…

Before loading data we need to create the file that will store the JSON. We’ll append all incoming data to this file…

If you see the message “Name node is in safe mode” then just wait for a minute as the namenode is still starting up.

Next download node.js (http://nodejs.org/download/) – if you’re using Unix you can ‘export PATH’ in the same way we did for hadoop.

Scripting in node.js is very quick thanks to the large number of packages developed by users. Obviously the quality can vary but for quick prototypes there always seems to be a package for anything. All we need to start is an empty directory where the packages and our script will be installed. I’ve picked three packages that will help us…

The webhdfs and twitter packages are obvious but I’ve also used the syncqueue package so that only one append command is sent at a time – Javascript is asynchronous. To use these create and edit a file named twitter.js and add….

And run node twitter.js

Now sit back and watch the data flow – here we’re filtering on “hadoop,big data” but you might want to choose a different query or even a different source – eg. tail a local log file, call a web service, run a webserver.

Tags:

Comments

  1. Jullin Egbuji Says:

    I see Hadoop listed in the NQSConfig.ini as a DB Dynamic Link Library. Are you going to demo this connectivity or blog about it so we can see more use via the rpd. Would you be in ODTUG at New orleans ? We have Hadoop and OBIEE but have been wanting to see if it is possible to have Hadoop as a data source.

  2. Mark Rittman Says:

    Jullin,

    Hadoop was in there, as you say, with 11.1.1.6 but it wasn’t supported by Oracle, or working fully. It is supported and working in 11.1.1.7 though, which just came out today – so watch this space for our thoughts on this initial implementation within OBIEE.

    Mark

  3. Rosemary Says:

    i have installed hadoop multinode cluster..after formatting namenode i start dfs..bt namenode is not listed in jps and it is not started..please help me..

  4. Alexandru Cobuz Says:

    Thanks for this tutorial, but can i run in paralel/async calls from same hadoop instance in node.js?

Website Design & Build: tymedia.co.uk