HDFS Benchmarks

Asking the Right Questions

I'm interested in benchmarking the HDFS NameNode, and to be able to understand the impact of different mixes of operations and workloads. As part of this, I've been reviewing Hadoop Benchmarks. I've been trying to limit my focus to benchmarks that make the HDFS their primary focus, as opposed to benchmarks that try to be representative of a MapReduce workload.

What follows in this post are a few thoughts on the challenges of trying to focus on the NameNode in an HDFS benchmark, a quick survey of about 10 different benchmarks, and some ideas as to what could be included in a new benchmark for a NameNode or its equivalent in a distributed file system.

Contents

On Benchmarking

HDFS was originally designed as a storage system that could coordinate a large number of commodity servers, and provide bulk streaming reads and writes to many hosts in parallel. This is still the primary use case, and fits well with the MapReduce programming model. When someone wants to "benchmark HDFS", they typically want the benchmark to run some bulk streaming IO scenario so they can use the benchmark numbers as a baseline to compare against another workload that may be more complicated to run.

Perhaps put another way, many HDFS benchmarks are meant to identify the bottleneck or saturation points in the various components and interconnects in the end to end system. If a simple identity map MapReduce job takes N seconds to process 500 Gigabytes of data, that can be used as a floor for understanding why a real MapReduce job takes N+M seconds.

In all of that data transfer, however, the NameNode's contribution can get lost. Instead of trying to put together a massive HDFS cluster to store legions of multi megabyte blocks, a NameNode benchmark needs to focus on being sure the NameNode is in fact being stressed. Rather than creating 100 one gigabyte files, a NameNode benchmark might better focus on creating many small or even empty files.

Benchmarks

DFSIO

TestDFSIO, found in hadoop-mapreduce-client-jobclient/src/test/j.o.a.h/fs, is the canonical example of a benchmark that attempts to measure the HDFS's capacity for reading and writing bulk data. The test can measure the time taken to create a number of large files, and then turn around and use those files as inputs to a test to measure the read performance an HDFS instance can sustain.

The test runs with many nodes, each running a "thread" to handle reading and writing to separate files in parallel. The test is structured as a MapReduce program to handle the simultaneous launch of those threads as Map tasks. Each map task runs an instance of the test, creating or reading a file. Each map task notes the time it starts and completes the operation, and the size of the data it transfers, and divides the size by the time to compute a rate.

The reduce task collect the results of the map and reports a summary of the benchmark, with the number of files and total bytes processed. It also reports two performance numbers: the "average I/O" and "throughput", both in megabytes a second. Now, you may wonder "aren't those basically the same thing", and the answer is yes. Michael Noll's blog post and the thread he links to describe the formula used to compute them, but it turns out that the "average I/O" is the arithmetic mean of the rates each map task calculated, and the "throughput" is the harmonic mean of the rates each map task calculated.

"Average I/O" and "throughput" are the means(averages) of performance individual execution slots in your cluster achieved. To understand the overall aggregate performance of your cluster, you must multiply those rates by the number of slots you have available for job execution. You also need to be sure that when you ran your benchmark, you had enough map tasks concurrently executing to fill those slots, so you can be sure that there is not a hidden bottleneck that running at full capacity would expose.

HiBench/DFSIO-E

DFSIO measurements are somewhat coarse, and report only a few summary statistics at the end and leave some uncertainty around any aggregate measurements of the cluster's performance. To address this, Intel included an "Enhanced DFSIO" as part of its HiBench suite of Hadoop benchmarks. As they describe DFSIO-E in their paper from the Workshop on Big Data Benchmarking in 2012, DFSIO-E attempts to report an aggregate performance curve, rather than just a summary rate at the end. It does this by sampling performance frequently during each map task execution instead of reporting a single number per map task. DFSIO-E assumes a somewhat synced clock throughout the cluster, and uses that to order the different samples by timestamp and fit an overall rate curve to the samples. DFSIO-E is not part of the standard Hadoop test suite, but perhaps should be.

NNBench and NNBenchWithoutMR

NNBench and its companion, NNBenchWithoutMR, are parts of the Hadoop Test Suite, found in hadoop-mapreduce-client-jobclient/src/test/j.o.a.h/hdfs. We'll start with NNBenchWithoutMR.

NNBenchWithoutMR is a single-threaded Java program that uses the Hadoop HDFS Client libraries to create, read, rename, and delete a collection of files, in that order. The size of the files created is configurable, though as the benchmark is meant to stress the NameNode file size is typically small. The actual data written (and in turn read) are all zeros, though the benchmark does not check to see that this is always preserved.

The benchmark starts by creating a unique working directory in HDFS. Then it moves on to the first sub-benchmark where for some provided number N files, it then creates a file, writes out the contents of the file, and closes the file before moving on to the next file in the test. The file is not replicated in HDFS, i.e. it has a replication factor of '1'. All operations are blocking and no calls are issued asynchronously. After writing N files, the benchmark can move on to the next sub-benchmarks over the N files where it opens and reads the contents of each file, renames each file, and deletes each file. Again, in each of the sub-benchmarks each file operation is completed before moving on to the next file, and each call is issued as a blocking call. At the end, the benchmark reports the total wall-clock time.

A slightly more sophisticated version of the benchmark, NNBench, is available that runs as a MapReduce job, much like TestDFSIO. In this version, the same core sub-benchmarks are used, with an enhancement that the benchmark records the time taken for each call that manipulates the file system. (This is not something that requires MapReduce, and could easily be done with the non-MapReduce version.) The user may also specify the replication factor for the files created. The MapReduce framework is used to launch multiple instances of the benchmark, each operating in its own unique working directory of HDFS. Each map task runs the benchmark once, and the number of map tasks created is specified by the user. A reduce tasks collects the benchmark results from each map task and computes overall statistics.

Load on the NameNode can be increased by running more benchmarks, either by executing more instances of the NNBenchWithoutMR, or by increasing the number of map tasks with NNBench. With the MapReduce version, there is no guarantee that the map tasks will all be run concurrently as intended, though this typically happens in practice. Both versions of the benchmark can pause until a specified wall-clock time and coordinate with a "synchronize watches" style barrier, to try to ensure all load is placed on the NameNode at the same time.

The NNBenchWithoutMR reports only wall-clock time for the entire benchmark. The MapReduce version is instrumented to track the time individual operations between the client and the NameNode take, and could be modified to report not only the average time but a more detailed distribution of those times. Unfortunately, because it instruments around the client library and not say the RPC layer, it may not have an accurate view of what's really occurring at the NameNode. It measures not only the time for the message to get from the benchmark to the NameNode, for the NameNode to process the message, and for the message response to return from the NameNode, but it also measures the time the client library marshals and unmarshals the messages and any other processing that may occur at the client. The Hadoop client libraries make heavy use of threads, and so non-determinism from queuing delay will also be included in measurement of client calls. The client library may also mask failures by automatically retrying failed operations, creating the appearance of a single call that took twice as long (or longer) rather than two or more calls between the client and NameNode, and accurately tracking the time each took.

The NNBench benchmarks deterministically construct their environment in a way that requires them to track little state but yet still be able to quickly decide on which files to operate. Files are created in the first step, which allows the benchmark to be sure the files are available, and follow a naming convention that allow it to easily compute a file to successfully read. Similarly, the inputs to the rename and delete operations are readily available with little more than a base filename and a counter in the benchmark.

S-Live

The Stress Test for Live Data Verification, or S-Live, is a more sophisticated benchmark than the NNBench set of benchmarks. It follows the same construction of DFSIO and NNBench in that it operates a basic benchmark many times in parallel using MapReduce, with the core benchmark as the map task. The reduce tasks summarize the results of the benchmark runs conducted by the maps. It was developed by Konstantin Shvachko, lately of WanDisco.

Like NNBench, S-Live is more designed to test the file operations of HDFS, and not necessarily focused on the bulk transfer ability of an HDFS cluster. S-Live creates, reads, renames, and deletes files, but goes beyond the operations from NNBench and also appends to files, creates directories, and lists the contents of directories.

S-Live departs from NNBench in two major ways. First, the data written in the files is the output of a hash function over an easily computed sequence, which allows the benchmark to verify the data it reads from an earlier write step. Second, rather than each map task operating a benchmark in an orderly fashion and in isolation from the other map tasks by working in a subdirectory, in S-Live map tasks can and do interact with the same files in parallel. S-Live copes with the parallelism by embracing failure; an operation like a read might well fail because the file has already been deleted by another map task. S-Live does not attempt to coordinate operations between map tasks.

The S-Live benchmark, like NNBench, algorithmically creates its environment on the filesystem, which reduces the amount of state the benchmark needs to maintain. The benchmark is built around a function that can generate a path to a file or directory from a sequence number. When S-Live wants to operate on a file, it pulls a random number between 0 and the maximum number of files in the benchmark, and can instantly derive the full path that it should use to identify that file. S-Live does not have to walk the directory tree or otherwise interrogate the file system to choose which file to use, nor keep an in-memory mapping table of files available.

Whereas NNBench ordered its operations as first all of the writes, then all of the reads, and so on, S-Live randomly selects which operation it will perform next. Each operation can be selected with equal likelihood, or the user may give each class of operation a different weighting to select the ratio each operation will receive. Because operations are selected at random, and the files to operate on are also selected at random, operations in S-Live frequently "fail" when they attempt to operate on files that do not yet exist, or that existed earlier but have been modified or deleted by another map task running the benchmark. To mitigate this, the distribution of operations over time can be biased. For example, the user may wish to have file creates and writes occur more frequently in the early parts of the benchmark, to make later reads and deletes more likely to succeed. In the extreme, an S-Live user may run the benchmark with a 100% write operation mix as a way to create a populated filesystem before starting a second run of the benchmark with additional operations enabled.

S-Live is included in the mainline Hadoop code, found in hadoop-mapreduce-client-jobclient/src/test/j.o.a.h/fs.slive. WanDisco has a blog post with an example of how to run it, which should apply to just about any Hadoop distribution. Shvachko also has a more detailed design document as part of the JIRA. Hortonworks also uses it to test the performance of the NameNode

LoadGenerator

The LoadGenerator tool can be used as a NameNode benchmark. It runs as a stand-alone tool, not as a MapReduce job, and stresses the NameNode through the DFS client libraries. LoadGenerator creates a number of threads. Each thread runs in a loop, randomly picking between read, write, and list file operations, with an optional probability mix for the operations. The benchmark keeps an in-memory list of the files and directories it can potentially operate on, discovered at benchmark startup time, and randomly selects one to act on at each step. For file creates, it randomly selects a directory in which to create the file, and then randomly chooses a file size, where the file size comes from a gaussian distribution with an average file size of 2 blocks and a 1 block standard deviation.

LoadGenerator runs for a given wall clock time, and reports a synopsis of the performance of each of the different operation types at the end. The benchmark can also be given a script that specifies how the probabilities of the different operations should change over different intervals, so you can run a simulation that has different access patterns over time.

The code is found in hadoop-common/src/test/java/org/apache/hadoop/fs/loadGenerator/. It is likely not a more compelling benchmark than NNBench or S-Live, though the scriptability may be beneficial for some experiments.

NNThroughputBenchmark

One of the earliest NameNode Benchmarks is the NNThroughputBenchmark. Also by Shvachko, NNThroughputBenchmark does not stress the NameNode through a MapReduce job, or even through the HDFS Client libraries. Instead, NNThroughputBenchmark invokes operations on the NameNode directly, bypassing the networking and RPC interface. The core NameNode object runs in the same process as the benchmark; there is no client or server. Like NNBench, the NNThroughputBenchmark runs through a set of distinct phases where it tests the same operation over many different files. NNThroughputBenchmark benchmarks creating file, opening a file, deleting a file, checking a file's status, and renaming a file.

NNThroughputBenchmark also exercises the block management functions of the NameNode. The benchmark simulates a number of DataNodes, and again directly invokes the functions inside the NameNode, rather than operating the full protocol between the NameNode and the DataNodes.

The main limitation of the NNThroughputBenchmark is everything operates inside the same Java process. Therefore, benchmark is limited to a single node, and that node must support both the benchmark driver operations and the NameNode object. The benchmark is threaded, so it can take advantage of multiple CPUs in a node to run multiple benchmark driver threads and offer more load to the NameNode object. As the benchmark code is relatively simple, combined with the bypassing of the RPC layer, it seems as least possible that the NameNode object is the bottleneck in the benchmark, and adding additional nodes to the test would not result in additional throughput. (There was some work in a JIRA by Todd Lipcon to start down the path towards a multi-node version of the benchmark, however, that work has not been pursued.)

NNThroughputBenchmark is part of the standard Hadoop distributions, and is found in hadoop-hdfs/src/test/j.o.a.h/hdfs.server.namenode. There is recent work to make it use the standard Hadoop tool interface and be a little more user-friendly. Shvachko describes it briefly in his paper in USENIX ;login

TestEditLog

To a first approximation, the NameNode is really an in-memory database with mutations persistently stored in a write-ahead log. For many operations, performance will be dominated by how fast updates can be logged to disk.

Studying how the underlying datastore affects filesystem performance is somewhat common, for a recent example see BabuDB, which is part of XtreemFS. For a Hadoop microbenchmark, Ivan Mitic offered a suggestion to use the TestEditLog test as a way to study just the performance of the write-ahead log.

TestEditLog is certainly not a nice, prepackaged benchmark. It's found in the hadoop-hdfs/src/test/j.o.a.h/hdfs.server.namenode package, and for the most part consists of a number of tests to recover from errors in the log. A NameNode benchmarking effort will only find it useful as a starting point.

QFS Benchmark

HDFS is the best-known of the filesystems that draw inspiration from the Google File System paper, but it is not the only one. Another well-known system was the Kosmos FileSystem, from Kosmix (now known as @WalmartLabs). The Kosmos FileSystem served as the basis for the Quantcast File System.

The Quantcast Team has some performance benchmarks comparing QFS to HDFS. We'll skip over the data throughput experiments, and instead focus on the metadata experiments. Quantcast makes the code for their 'mstress' application available on github

The mstress benchmark is meant to be run from multiple hosts simultaneously. Rather than a MapReduce job, mstress includes a number of Python scripts to help prepare and launch the benchmark from the different hosts with help from ssh. The same Python code launches benchmarks for either filesystem, but the benchmark for HDFS is written in Java, and the benchmark for QFS is written in C++.

Mstress does not assume that there are Hadoop DataNodes available (or ChunkServers, the QFS equivalent) and so it only tests directory operations like creating, listing, stating, and deleting, which do not require any blocks to be allocated. (At one point, HDFS did not permit empty files, but that hasn't been true for a long time. It may be that QFS does not permit them.)

The benchmark is given a few parameters that describe the directory tree on which it should operate, such as how many levels of directories there should be, and how many subdirectories per level. The benchmark does not need to maintain much state: the creation benchmark builds the entire tree, the list directory benchmark walks the tree and only keeps a count of directories encountered. The stat benchmark, which will randomly select some subset of the directories to examine, can generate paths that it wishes examine on demand and does not need a list a priori. Benchmarks running on different hosts work in different subdirectories so nodes do not interfere with each other's operations.

The mstress benchmark is a promising start, but would greatly benefit from operations beyond directory manipulation.

Ohio State Infiniband benchmark

Recently, a group has come together calling itself the Big Data Benchmarking Community. As one would expect, "Big Data Benchmarking" is a fairly expansive term, and the community has been exploring just what is in and out of the scope of their common interests. The BDBC holds frequent conference calls, and has held a workshop about every 6 months since the beginning of 2012.

Hadoop has been well-covered in the the community. At the 2nd workshop, the Network Based Computing Lab of the Ohio State University presented some of their work on HDFS over InfiniBand, including some microbenchmarking results. Their suite of microbenchmarks target five different measurements: Sequential Write Latency, Sequential or Random Read Latency, Sequential Write Throughput, Sequential Read Throughput, and Sequential Read-Write Throughput.

It's hard to say exactly what each test does, given only the slides, but the two latency tests appear to read or write a single large file to both 4 and 32 DataNodes, and the Throughput experiments appear to be with multiple DataNodes and multiple clients. The tests compare performance over a 32Gbps Infiniband and 1 Gigabit Ethernet. InfiniBand always wins, but not by nearly as much as one would expect a networking interconnect that is so much faster than Ethernet. (Each node appears to only have one disk, which is an odd balance and may be the bottleneck)

The benchmark is not yet available. The Network-Based Computing Lab hopes to release it as part of their next release, sometime in early fall of 2013.

SWIM

Most of the benchmarks we have examined thus far have been HDFS focused, but we make an exception for the Statistical Workload Injector for MapReduce (SWIM) toolset. SWIM is designed to help create MapReduce benchmarks that are statistically similar to real workloads. It eschews attempting to reduce a MapReduce workload to any well-known distribution, and instead uses the set of traces from a production workload as the "model". When SWIM generates a benchmark, it ensures that the workload shares the same characteristics as the production workload, scaled for whatever size cluster on which the user wishes to run the benchmark.

SWIM operates in several phases. In the first phase, SWIM includes a tool to load the full set of logfiles of a target MapReduce workload. This data consists of the runtime of the job (both overall, and broken down into "MapSeconds" and "ReduceSeconds", which are the sums of all the Map Task running times and Reduce Task running times), the amount of data processed by the Map, Shuffle, and Reduce phases, as well as information about the job arrival rate. A second tool samples from this data set to produce a smaller workload that is similar to the overall data set over several dimensions.

The SWIM benchmark ships with summarized workloads originally derived from Facebook production workloads. Cloudera also internally uses summaries derived from their customer workloads, however, those have not yet been made available. (Yahoo reports that they are making several of their traces available, which could potentially be additional example data for SWIM)

In the second phase, SWIM can take the summarized workloads and produce a set of MapReduce jobs that "replay" the summary, potentially with a new cluster configuration. First, it creates enough synthetic input data that it can satisfy the data reads of every job. Then, it generates a set of MapReduce jobs, and a script to launch them with an inter-job arrival rate derived from the original workload. These generated MapReduce jobs read from the synthetic data, and pass along appropriately sized shuffle and reduce data.

To close the loop, the same tools used in the first phase which produce a summary of a MapReduce workload can be run on the raw output of the second phase, and in turn produce a summary that can be compared to the original summary.

Being able to build workloads that are representative of a user's actual workload is clearly desirable. Unfortunately, as an HDFS metadata benchmark, SWIM falls short. The log files of a MapReduce job contain limited information about the file system interactions. At best, the log files contain the directories used for inputs, and the total amount of data processed. There is no information about the characteristics of individual files and no way for SWIM to build a similar structure. If a MapReduce job reads a directory full of 1 million files, each one megabyte, SWIM stores that this was a job that read a terabyte. When it generates synthetic data, it may very well create many fewer but larger files, so long as a terabyte of data is available.

Towards a new benchmark

We believe that the time is right for a new benchmark to help understand the performance of filesystem metadata management in a distributed storage system. Targeting this benchmark against the Hadoop NameNode is one important case, but an ideal benchmark would not be tied explicitly to Hadoop. In this section, we describe what we see as important in a new benchmark, borrowing aspects and examples from the benchmarks listed above as appropriate.

Desired Quality 1: Portable and easily implementable

HDFS is a core staple of Hadoop. At the initial release, Hadoop was only a library for writing MapReduce style programs, an execution engine to run those programs, and HDFS to organize data into files, directories, and blocks for MapReduce programs. Since that first release, the world around HDFS has grown more complicated, and some of proposed moving beyond HDFS with new file systems, or adapting Hadoop to use existing parallel file systems or storage systems.

To keep the focus somewhat grounded, we will only consider storage systems that "look like filesystems." Data can can be managed in other forms, such as database tables, object graphs like Core Data, or iterators over streaming data, but we won't try to propose a benchmark that could be used to measure the metadata management of those systems.

To put this in concrete terms, we envision a benchmark with fundamental operations like creating files and directories, opening files for reading or writing, listing directories, adding data to files, renaming and deleting files.

Programs using the Hadoop libraries can use the org.apache.hadoop.fs.FileSystem class to interface to many filesystems, beyond just HDFS. The Hadoop Compatible FileSystem(HCFS) standard is an effort to more formally specify the semantics that this interface will provide. HCFS is tracking CassandraFS, CephFS, CleverSafe Object Storage, GlusterFS, MapR Filesystem, QFS, the Symantec Veritas Cluster Fileystem, and should likely add Tachyon from UC-Berkeley, and Azure Storage Vaults.

By trying to stay as close as possible to a level that can be easily expressed using operations included in the HCFS contract, a benchmark can move between systems, using the Filesystem in the worst case, but also potentially using an interface native to that system.

A benchmark may wish to target some operations that cannot be expressed through the Filesystem class. For example, the NameNode must process block information from the various DataNodes. We would like to be able to include this a benchmark, but that interface is not exposed through the Filesystem class.

Desired Quality #2: Don't require the full stack

The NameNode is unique in Hadoop in that it is largely a self-contained black box. After a given sequence of calls, it will be in a known state. The identity of the originator of those calls is somewhat irrelevant: If a call says 'node 23 wants to store a block', the NameNode does not go to great lengths to verify that node 23 is in fact the one making the request. This idea is embraced by the NNThroughputBenchmark, which creates an entirely synthetic environment in which the NameNode can operate, entirely oblivious to the fact that there is no cluster.

This flexibility makes it easier to construct and operate benchmarks that demonstrate the core performance challenges, without requiring a large support infrastructure. Consider the TestEditLog example: if the question is how fast can the NameNode log data to disk, which will bound the number of operations per second the NameNode can sustain, we do not need a full HDFS cluster available to determine that number. Similarly, with systems like KTHFS, which uses a Relational Database Management System to store the state of a NameNode, being able to construct a benchmark that simply issues the appropriate query to the underlying database will also allow a benchmarker to understand the lower bound of what the worst case performance might be for the system, without the challenge of building the full system.

Desired Quality #3: Separate data generation from benchmark driver

One of the common themes in existing NameNode benchmarks is that the file and directory structure they interact with is relatively simple, and almost always computed on demand with little state used to track it. By reducing the amount of work the benchmark needs to do to compute the environment, there is less likelihood that there is any overhead polluting the benchmark's measurements.

However, there is no reason that the benchmark needs to combine the operation execution driver from the code that determines the operations to perform. In our ideal benchmark, the benchmark driver can optionally be given a pre-computed scenario of operations to perform, and be set forth on executing them as fast as the underlying system will accept the operations in an "operation execution" phase. By decoupling the two phases, the "scenario generation" phase of determining which operations and on which files to run can be computed ahead of time. Computing these operations off-line allows the benchmark to construct more complicated scenarios and use whatever resources and time it needs to compute the scenario.

Desired Quality #4: Operate on a mix of input data

By decoupling the benchmark into a "scenario generation" phase and a "operation execution" phase, the scenarios to be benchmarked can come from a variety of sources.

There is little published on exactly what operations a NameNode encounters in typical deployment -likely because there is not really anything like a "typical" HDFS use case. One of the few examples available is from Facebook in 2010, which classified the distribution of operations at the NameNode as:

  • stat a file or directory 47%
  • open a file for read 42%
  • create a new file 3%
  • create a new directory 3%
  • rename a file 2%
  • delete a file 1%

That does not mean that a benchmark should fix its operation frequency to match only that distribution. Instead, we want to support multiple scenarios. We may want to be able to replay a trace from an exact workload, or we may want to replay a trace synthesized through something like SWIM. We may also want to try a scenario that we know is unrealistic but that still shows some pathological behavior.

Desired Quality #5: Expose parallelism, but capture dependencies

The NameNode or any other metadata management system must be able to handle concurrent operations, and indeed, many of the performance issues of the NameNode come from handling the locking operations required to maintain its internal consistency. Therefore, any benchmark must issue some number of simultaneous operations.

The challenge is how to issue the operations in a realistic fashion. We expect that the NameNode will respect the C in ACID, but it is up to the application (and hence the benchmark) to ensure that the states the NameNode is moving between are consistent to be begin with. For example, an application would not in a normal setting try to create a file before creating the parent directory for that file. Existing benchmarks cope with this in different ways. Some, like NNBench, carefully order their operations, to ensure that each operation has the correct state to succeed. Others, like S-Live, are intentionally oblivious and in fact consider testing the error conditions to be an important part of the benchmark.

We believe a benchmark should take the middle ground. Scenarios to be executed will have dependency information included - a file create operation will depend on the creation of the parent directory. The benchmark will use the dependency information to create a serializable schedule of operations. It will be up to the benchmarker to ensure that the trace provides enough potential parallelism in the scenario to adequately exercise the benchmark.

Desired Quality #6: Deterministic

Benchmarks should be repeatable. For many of the existing benchmarks, this is accomplished by running the benchmark with the same parameters. Others may need to ensure the same seed is fed to the random number generator to produce the same sequence of events. In a trace-driven benchmark, much of the work of determinism is easily handled: there is only one set of decisions to be made about which operation to perform next.

However, to make it easier for the benchmark to stay deterministic while running at high speed, there are some small changes to Hadoop that could be helpful. Not every call to the NameNode returns the same result: allocating a new block gets a new block identifier from the NameNode, and the NameNode makes no guarantees about which nodes it may chose to place block replicas.

Allowing the client to provide hints or outright requirements for what to call blocks or on which nodes to place blocks would make a trace replay much easier. For block identifiers, the NameNode would still be responsible for ensuring the ID is valid and not already in use. Then, rather than having to track what the NameNode decides to assign for a block identifier, the benchmark driver can just ensure that it got back what it expected. This extension is helpful for situations like HDFS Federation or Content-Addressable storage systems, where the ID of the block would not be under the control of the NameNode anyway.

As for host placement, the NameNode already allows for a pluggable API in the BlockPlacementPolicy interface to decide where to place a block. Extending the API to allow the client to include hints is a reasonable extension, useful beyond a benchmark.

Wrapping up

We'll be updating this post as we find more benchmarks - please let me know what I've missed!