Friday, March 14, 2014

Compare & Contrast HDFS with GFS


Compare & Contrast HDFS with GFS


HDFS has been inspired from  the original work documented in Google File System (GFS) paper. The broader goals are Performance, Scalability, Reliability & Availability.The basic philosophy is that component failures are the norm rather than the exception. In a distributed system, there would be application bugs, operating system bugs, human errors, and the failures disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the file system.The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis. While most of the capabilities are similar, the goal of this article is to educate on the subtleties where they are different.

Some key guiding principals for bothsystems
  • Large data sets – usually Petabytes & Terabytes as opposed to conventional DBMS which are usually in GB ; a typical block size is 512MB as opposed to KB
  • Largely append only files which can be read sequentially - some subtleties & differences in GFS which we will cover later.
  • Data locality - to move the processing close to data.

Architecture
Both systems have very similar architecture . HDFS consists of a single master server called NameNode which stores file system metadata. Application data is stored on other servers called DataNodes. On the other hand, a GFS cluster consists of a single master and multiple chunk servers. Master includes the namespace, access control information, the mapping from files to data blocks (chunks in GFS), and the current locations of blocks. Master also also controls system–wide activities such as lease management, garbage collection of orphaned blocks, and migration between DataNodes (chunk servers) . The master periodically communicates with each node in heartBeat messages to give it instructions and collect its state.


The entire namespace metadata is kept in RAM. The inode data and the list of blocks belonging to each file comprise the metadata of the name system called the image. The persistent record of the image stored in the local host’s native files system is called a checkpoint. 

Master Replication

The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of two other roles, either a CheckpointNode or a BackupNode. 

The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal. It downloads the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode.A BackupNode on the other hand is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup . This further speeds up recovery and improves availability.

The BackupNode accepts the namespace transactions from the active NameNode, saves them to its own storage directories, and applies these transactions to its own namespace image in memory.  If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state.Since it already has an up-to-date namespace image in its memory, the recovery time in the order of minutes. 

GFS on the other hand calls the BackupNode as Shadow masters, which provide read–only access to the file system even when the primary master is down. To keep itself informed, a shadow master reads a replica of the growing operation log and applies the same sequence of changes to its data structures exactly as the primary does. 

File System
Each of the FS consists of data blocks (chunks) . When there is a need for a new block, the NameNode allocates a block with a unique block ID and determines a list of DataNodes to host replicas of the block. Assuming treplication (3 replicas) ,  the DataNodes form a pipeline. Bytes are pushed to the pipeline as a sequence of packets.Each block replica on a DataNode is represented by two files in the local host’s native file system. The first file contains the data itself and the second file is block’s metadata including checksums for the block data and the block’s generation stamp.

Concurrent Readers 
Both HDFS & GFS clients that open a file for writing is granted a lease for the file; no other client can write to the file. The writing client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the lease is revoked.The lease duration is bound by a soft limit and a hard limit. Until the soft limit expires, the writer is certain of exclusive access to the file. If the soft limit expires and the client fails to close the file or renew the lease, another client can preempt the lease. If after the hard limit expires (one hour) and the client has failed to renew the lease. The writer's lease does not prevent other clients from reading the file; a file may have many concurrent readers.

Concurrent Writers

The key difference between the two systems is in the area of dealing with concurrent writers . HDFS is primarily an append only file system which means that it does not provide  the capability to have multiple writers to the file.Moreover, after the data is written to an HDFS file, HDFS does not provide any guarantee that data are visible to a new reader until the file is closed. If a user application needs the visibility guarantee, it can explicitly call the hflush operation. Then the current packet is immediately pushed to the pipeline, and the flush operation will wait until all DataNodes in the pipeline acknowledge the successful transmission of the packet. All data written before the hflush operation are then certain to be visible to readers.

GFS on the other hand provide ability to both mutate at a specific offset – similar to a traditional write – and an atomic append operation called record append. Concurrent writes to the same region are not serializable: the region may end up containing data fragments from multiple clients and the final result though consistent is non deterministic or undefined. In a record append, however, the client specifies only the data. GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client.

GFS has certain distinct states to deal with write/appends - the state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations.
  1. A file region is considered consistent if all clients will always see the same data, regardless of which replicas they read from. 
  2. Further a region is considered defined if a mutation succeeds without interference from concurrent writers.
  3. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but could consists of mingled fragments from multiple mutations. 
  4. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times.



Most of the applications use record append then overwrite at an offset. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append.


Pipelining
The other big difference between the two systems is the approach taken towards data replication.

When a new block is created, HDFS places the first replica on the node where the writer is located, the second and the third replicas on two different nodes in a different rack, and the rest are placed on random nodes with restrictions that no more than one replica is placed at one node and no more than two replicas are placed in the same rack.When a new block is being written and all target nodes are selected, nodes are organized as a pipeline in the order of their proximity to the first replica. Data are pushed to nodes in this order. For reading, the NameNode first checks if the client’s host is located in the cluster. Both the control and data flows from primary to secondary replicas



In GFS , each mutation is performed at all the chunk’s replicas. Leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. So in this regard the control flow is similar to how HDFS works. Where GFS is different is how the data flows to the replicas.




Unlike HDFS , the client simultaneously pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, GFS improves performance by scheduling the expensive data flow based on the network topology regardless of which chunkserver is the primary. By decoupling the flow of data from the flow of control GFS uses the network efficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunk servers in a pipelined fashion.