The other day, I woke up with what I thought would be a simple task: "Let's figure out how hard would it be to store the edit log of the Hadoop NameNode in a relational database." After all, I knew the NameNode already had multiple options for storing its edit log, either by putting in on disk itself, or by using the Quorum Journal Manager. This should be task of just finding the right class and either slipping in a new subclass or implementing some interface.
Of course, nothing is ever that simple.
We'll return to the relational database storage at some point in the future (and yes, I'm aware of KTHFS.) Today I want to write about a bigger problem.
Finding good information on the Hadoop internals is hard, which might seem strange given how much Hadoop information is out there. Just to be clear, there's a couple of things that I'm explicitly not looking for.
Obviously, with this much hype around Big Data, there's a lot written about what Hadoop is and where it's going in the press and in blogs. Some of it's really good, and some of it is overblown filler about how Hadoop and Big Data are eventually going to drive my car, pick out my clothes in the morning, find me friends to hang out with that night, and replace every part of my computing ecosystem and also give me unicorn. This isn't my focus
I'm also not looking for material on how to use Hadoop. For that, you really should begin and end your search with Tom White's book Hadoop: The Definitive Guide and Eric Sammer's book Hadoop Operations. If you want to go a little further, there are any number of Hadoop tutorials or blog posts on "how to write WordCount in MapReduce." (And if you're thinking about writing yet another "how to write WordCount in MapReduce" blog post, please reconsider and think if there isn't something better you can do. There's certainly room for more "how do I implement algorithm X in MapReduce", so aim a big higher.)
Further cluttering the search space are any number of websites which as near as I can tell exist solely to run Javadoc and stick the results up on the web, which really just pollute the search results. (Similarly, how many bloody ways do we need to get the Hadoop mailing lists on to the web?) These sites seem to offer little over the Apache site and I wish they'd vanish from the search results.
So, now that I've spent a bunch of time explaining what I don't want I should focus on what I do want. I'm looking for prose that explains the code of the Hadoop daemons and libraries and how it all fits together. What messages are exchanged when a file is opened? How does the scheduler track running jobs, and what classes turn the logic of the operation into something actionable? Dropping someone onto the Javadoc of the various classes doesn't help if you don't know how the control flows through the code.
All of this information obviously exists, of course - if nothing else, you can always start at
$HADOOP_COMMON_HOME/bin/hadoop-daemon.sh and with a little bit of knowledge of how shell scripts and Java programs are invoked, trace the code from its start on the command line until the operation you want to understand is hit.
There are also many great JIRAs or mailing list entries that have these details, but they're scattered all across the web. I'm looking for somewhere that brings this information together and weaves it together in a coherent story.
What you'll find in this blog
I know enough about the Hadoop codebase that I rarely need to start at the command line and trace forward, but there are still many parts of the code that I don't much know and I'm going to need to explore the code and the web to try to understand them.
As I do, I'm going to try to turn my notes into blog posts that show how the parts fit together, in the hopes that someone finds them useful.
I'm someone who doesn't like and can't accept magic. I need to have some idea of how the computer does what we tell it to do. Ruby on Rails, for example, drives me crazy trying to understand just how it works.
I'm a big fan of blog posts and articles that take away that magic and demonstrate how software works. Some things that I really enjoyed:
RESTful routing in Rails - a 10 part series on how Rails actually handles requests, constructed from the bottom up
Closer to Hadoop, Chris Zheng's Hadoop RPC Client walkthrough is good, though unfortunately already dated.
Taking on a larger scope, Robert Love's Linux Kernel Development book is a good example of what I wish existed for Hadoop.
So, gentle reader, a few things to know before we get too far into this.
First, I will swear frequently in my posts.
Second, I fucking hate Java. Steve Yegge is spot-on. Not only do I not like the language, I dislike most of the tools. Eclipse is slow and clunky, and I have no idea why Maven downloads as much stuff as it does.
Third, I don't post on any sort of a regular schedule, and I'm likely to cover only some parts of Hadoop. I'm interested in the data storage parts of Hadoop - the NameNode, the DataNodes, the RPC systems. I'm not especially interested in the process execution side like MapReduce and YARN. So, while I'd love for there to be a Hadoop Internals book, I don't expect that I'll ever have enough material to be the guy who writes it.
Finally, I only care about the trunk of Hadoop. Cloudera did an April Fools' Day blog post on Hadoop versioning that might have hit a little too close to home for some of the Hadoop developers. To minimize confusion, I'm going to stick to the trunk of the Hadoop source code and not try to get into the various distributions.
At the moment, this blog is hosted as static pages on github.
The static site generator is Pelican
The theme is a port of default Octopress theme to Pelican by Maurizio Sambati. It also includes patches from Jake Vanderplas, as well as a few shortcuts from his Makefile for publishing and his pelican plugins to support his patches. Jake's work on integrating IPython notebooks are what sold me on Pelican in the first place, so I'm especially grateful for his work.
Several posts will use Gist-it by Sudar Muthu, which is an awesome tool that runs on Google App Engine, and lets you embed specific sections of files on Github (say lines 30 to 50 of
foo.java) into a document, without having to copy that code into a stand-alone gist. At the moment it only links to files, and not specific versions of the file, which might make things problematic later but should be an easy fix if and when that day comes.