Wednesday, May 28, 2014

Google's Big Data Systems - Megastore vs Spanner


Megastore and Spanner both blend the traditional RDBMS and NoSQL worlds by providing highly available, scalable and a globally distributed system. Both are used primarily for online interactive services. This article highlights few notable features in each along with key differences.

Megastore

Lets first start with Megastore, which is used extensively within Google by several hundred applications. High availability is achieved through a synchronous, fault-tolerant log replicator optimized for cross data-centers writes. Scalability is achieved by partitioning the data into small databases each with its own NoSQL data store and replicated log stored in per replica.


Deployment
To scale throughput, it partitions the data into a collection of entity groups, which independently and synchronously are replicated over a wide area. Each entity group has entities, which are abstractions for rows in a traditional database tables.

Megastore supports single single-phase ACID transactions within an entity group while two-phase commit for atomic updates across entity groups through asynchronous messaging using queues A transaction writes its mutations into the entity group’s write-ahead log, and then the mutations are applied to the data.
It provides a single, consistent view of the data stored in its underlying replicas. Reads and writes can be initiated from any replica, and ACID semantics are preserved regardless of what replica a client starts from. Replication is done per entity group by synchronously replicating the group’s transaction log to quorum of replicas.

Data Model
Megastore provides a semi-relational data model that abstract tuples of an RDBMS with column storage of NoSQL – it uses Bigtable as its storage subsystem. Each schema has a set of tables, each containing a set of entities, which in turn contain a set of properties. Properties are named and typed values. A sequence of properties forms the primary key of the entity. Keys are chosen to cluster entities that will be read together.

Storage and Layout
Megastore uses Bigtable as its scalable datastore while adding richer primitives such as ACID transactions, indexes, and queues. Each entity is mapped into a single Bigtable row; the primary key values are concatenated to form the Bigtable row key, and each remaining property occupies its own Bigtable column. The column name is a concatenation of the Megastore table name and the property name, allowing entities from different Megastore tables to be mapped into the same Bigtable row. Bigtable provides the ability to store multiple values in the same row/column pair with different timestamps, which is used by Megastore to implement multiversion concurrency control (MVCC).

Applications control the latency & throughput by placement of data through the selection of Bigtable instances and specification of locality within an instance. To minimize latency, applications keep data near users and replicas near each other. Entity groups are assigned close to the region or continent from which it is accessed most. Within that region it can assign a triplet or quintuplet of replicas to datacenters with isolated failure domains. To maximize throughput, the data for an entity group are held in contiguous ranges of Bigtable rows. The schema language lets applications control the placement of hierarchical data, storing data that is accessed together in nearby rows or denormalized into the same row.



Paxos
The Paxos algorithm is a way to reach consensus among a group of replicas on a single value. Databases typically use Paxos to replicate a transaction log, where a separate instance of Paxos is used for each position in the log. New values are written to the log at the position following the last chosen position.

The Paxos algorithm demands multiple rounds of communication with high network latency. Writes usually require at least two round trips before consensus is achieved: a round of prepares with a subsequent round of accepts. Reads require at least one round of prepares to determine the last chosen value. Real world systems built on Paxos reduce the number of round trips required to make it a practical algorithm.

Megastore uses Paxos to replicate primary user data across datacenters on every write. To minimize latency, many distributed systems use a dedicated master to which all reads and writes are directed. Reliance on a master limits flexibility for reading and writing. Megastore unlike dedicated master systems allows current reads against any replica. This gives it better utilization and low latencies.


Spanner

Lets turn out attention to Spanner which combines the familiar, easy-to-use, semi-relational interface, transactions, and an SQL-based query language with scalability, automatic sharding, fault tolerance, consistent replication, external consistency, and wide area distribution.

Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Data is stored in schematized semi-relational tables; data is versioned, and each version is automatically timestamped with its commit time. Spanner supports general-purpose transactions, and provides an SQL-based query language. Unlike Bigtable, Spanner assigns timestamps to data, which is an important way in which Spanner is more like a multi-version database than a key-value store.

Similar to Megastore, the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other, and how many replicas are maintained - to control durability, availability, and read performance. Data can also be dynamically and transparently moved between datacenters by the system to balance resource usage across data centers.

Deployment
A Spanner deployment is called a universe. It is organized as a set of zones. Zones are the unit of administrative deployment which be added or removed tied to the corresponding datacenters.

Fig 3 : Spanner Server Configuration


A zone has one zonemaster with several thousand spanservers. The intermediate location proxies are used by clients to locate the spanservers assigned to serve their data. The universe master and the placement driver are singletons. Universe master provides universe manageability and debuggability while placement driver is for data replication. Each spanserver in turn has thousands of instances of tablet(s). A tablet is similar to Bigtable’s tablet.

(key:string, timestamp:int64) → string

Each spanserver implements a single Paxos state machine on top of each tablet. The Paxos state machines are used for replication. The key-value mapping state of each replica is stored in its corresponding tablet. Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date. The set of replicas is collectively a Paxos group.


At every replica that is a leader, each spanserver implements a

1)       Lock table for concurrency control. The lock table contains the state for two-phase locking: it maps ranges of keys to lock states.

2)       A transaction manager to support distributed transactions. The transaction manager has a participant leader while the other replicas are participant slaves. If a transaction involves only one Paxos group it bypasses the transaction manager while if transaction involves more than one Paxos group, those groups’ leaders coordinate to perform two- phase commit. With one of the participant groups is chosen as the coordinator.


Figure 4: Spanserver Software Stack


Data Model
Applications create one or more databases in a universe. Each database can contain unlimited number tables. Tables look like relational-database tables, with rows, columns, and versioned values that can be queries through SQL like language.

Spanner’s data model is not purely relational. Every table is required to have one or more primary-key columns. The primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns.

Storage & Layout
Spanner organizes data into Directory (ies), which is a unit of data placement. All data in a directory has the same replication configuration. Paxos group may contain multiple directories implies that a Spanner tablet is different from the Bigtable tablet – the latter is lexicographically contiguous partition of the row space. Directory is also the smallest unit whose placement can be configured by an application.

Time Synchronization
A unique capability of Spanner is that it has an ingenious implementation for clock synchronization with a strong TrueTime API.  This enables it to support globally commit timestamps for distributed transactions. The timestamps reflect serialization order with external consistency, in brief also called linearizability.  
TrueTime API is implemented by a set of time master servers per datacenter and a timeslave daemon per node. The majority of masters have GPS receivers with dedicated antennas. The remaining masters have atomic clocks. The masters not only compare their time references against each other, but also crosscheck the rate at which its reference advances time against its own local clock.

With the above background lets enumerate the similarities and differences between the systems.

Similarities


Spanner
Megastore
Multi Data Center
Yes
Yes
Consistency
Strong
Strong
Replication
Application controlled
Application controlled
State management
Paxos
Paxos
Transactions
MVCC
MVCC



Differences


Spanner
Megastore
Storage
Custom
Bigtable
Linearizability
Yes
No
Globally Consistent Transactions
Yes
No
Schema
Complex

Semi-relational
Query Interface
SQL like
Unknown
Storage
Custom
BigTable
Time Synchronization
TrueTime API
N/A
Deployment
Universe; Zones and
Spanservers
Entity Groups; BigTable
Data Storage
Directories/Placements
Entity groups
Long Lived Transactions
Yes
No
Performance
High
Relatively lower – lower throughput if there are several writes per paxos group
Applications
F1/AdWords
300 - Gmail, Picasa, calendar, Android, Market and AppEngine
  


Friday, March 14, 2014

Compare & Contrast HDFS with GFS


Compare & Contrast HDFS with GFS


HDFS has been inspired from  the original work documented in Google File System (GFS) paper. The broader goals are Performance, Scalability, Reliability & Availability.The basic philosophy is that component failures are the norm rather than the exception. In a distributed system, there would be application bugs, operating system bugs, human errors, and the failures disks, memory, connectors, networking, and power supplies. Therefore, constant monitoring, error detection, fault tolerance, and automatic recovery must be integral to the file system.The system is built from many inexpensive commodity components that often fail. It must constantly monitor itself and detect, tolerate, and recover promptly from component failures on a routine basis. While most of the capabilities are similar, the goal of this article is to educate on the subtleties where they are different.

Some key guiding principals for bothsystems
  • Large data sets – usually Petabytes & Terabytes as opposed to conventional DBMS which are usually in GB ; a typical block size is 512MB as opposed to KB
  • Largely append only files which can be read sequentially - some subtleties & differences in GFS which we will cover later.
  • Data locality - to move the processing close to data.

Architecture
Both systems have very similar architecture . HDFS consists of a single master server called NameNode which stores file system metadata. Application data is stored on other servers called DataNodes. On the other hand, a GFS cluster consists of a single master and multiple chunk servers. Master includes the namespace, access control information, the mapping from files to data blocks (chunks in GFS), and the current locations of blocks. Master also also controls system–wide activities such as lease management, garbage collection of orphaned blocks, and migration between DataNodes (chunk servers) . The master periodically communicates with each node in heartBeat messages to give it instructions and collect its state.


The entire namespace metadata is kept in RAM. The inode data and the list of blocks belonging to each file comprise the metadata of the name system called the image. The persistent record of the image stored in the local host’s native files system is called a checkpoint. 

Master Replication

The NameNode in HDFS, in addition to its primary role serving client requests, can alternatively execute either of two other roles, either a CheckpointNode or a BackupNode. 

The CheckpointNode periodically combines the existing checkpoint and journal to create a new checkpoint and an empty journal. It downloads the current checkpoint and journal files from the NameNode, merges them locally, and returns the new checkpoint back to the NameNode.A BackupNode on the other hand is capable of creating periodic checkpoints, but in addition it maintains an in-memory, up-to-date image of the file system namespace that is always synchronized with the state of the NameNode.The checkpoint is in a compact B-tree like form that can be directly mapped into memory and used for namespace lookup . This further speeds up recovery and improves availability.

The BackupNode accepts the namespace transactions from the active NameNode, saves them to its own storage directories, and applies these transactions to its own namespace image in memory.  If the NameNode fails, the BackupNode’s image in memory and the checkpoint on disk is a record of the latest namespace state.Since it already has an up-to-date namespace image in its memory, the recovery time in the order of minutes. 

GFS on the other hand calls the BackupNode as Shadow masters, which provide read–only access to the file system even when the primary master is down. To keep itself informed, a shadow master reads a replica of the growing operation log and applies the same sequence of changes to its data structures exactly as the primary does. 

File System
Each of the FS consists of data blocks (chunks) . When there is a need for a new block, the NameNode allocates a block with a unique block ID and determines a list of DataNodes to host replicas of the block. Assuming treplication (3 replicas) ,  the DataNodes form a pipeline. Bytes are pushed to the pipeline as a sequence of packets.Each block replica on a DataNode is represented by two files in the local host’s native file system. The first file contains the data itself and the second file is block’s metadata including checksums for the block data and the block’s generation stamp.

Concurrent Readers 
Both HDFS & GFS clients that open a file for writing is granted a lease for the file; no other client can write to the file. The writing client periodically renews the lease by sending a heartbeat to the NameNode. When the file is closed, the lease is revoked.The lease duration is bound by a soft limit and a hard limit. Until the soft limit expires, the writer is certain of exclusive access to the file. If the soft limit expires and the client fails to close the file or renew the lease, another client can preempt the lease. If after the hard limit expires (one hour) and the client has failed to renew the lease. The writer's lease does not prevent other clients from reading the file; a file may have many concurrent readers.

Concurrent Writers

The key difference between the two systems is in the area of dealing with concurrent writers . HDFS is primarily an append only file system which means that it does not provide  the capability to have multiple writers to the file.Moreover, after the data is written to an HDFS file, HDFS does not provide any guarantee that data are visible to a new reader until the file is closed. If a user application needs the visibility guarantee, it can explicitly call the hflush operation. Then the current packet is immediately pushed to the pipeline, and the flush operation will wait until all DataNodes in the pipeline acknowledge the successful transmission of the packet. All data written before the hflush operation are then certain to be visible to readers.

GFS on the other hand provide ability to both mutate at a specific offset – similar to a traditional write – and an atomic append operation called record append. Concurrent writes to the same region are not serializable: the region may end up containing data fragments from multiple clients and the final result though consistent is non deterministic or undefined. In a record append, however, the client specifies only the data. GFS appends it to the file at least once atomically (i.e., as one continuous sequence of bytes) at an offset of GFS’s choosing and returns that offset to the client.

GFS has certain distinct states to deal with write/appends - the state of a file region after a data mutation depends on the type of mutation, whether it succeeds or fails, and whether there are concurrent mutations.
  1. A file region is considered consistent if all clients will always see the same data, regardless of which replicas they read from. 
  2. Further a region is considered defined if a mutation succeeds without interference from concurrent writers.
  3. Concurrent successful mutations leave the region undefined but consistent: all clients see the same data, but could consists of mingled fragments from multiple mutations. 
  4. A failed mutation makes the region inconsistent (hence also undefined): different clients may see different data at different times.



Most of the applications use record append then overwrite at an offset. Record append allows multiple clients to append data to the same file concurrently while guaranteeing the atomicity of each individual client’s append.


Pipelining
The other big difference between the two systems is the approach taken towards data replication.

When a new block is created, HDFS places the first replica on the node where the writer is located, the second and the third replicas on two different nodes in a different rack, and the rest are placed on random nodes with restrictions that no more than one replica is placed at one node and no more than two replicas are placed in the same rack.When a new block is being written and all target nodes are selected, nodes are organized as a pipeline in the order of their proximity to the first replica. Data are pushed to nodes in this order. For reading, the NameNode first checks if the client’s host is located in the cluster. Both the control and data flows from primary to secondary replicas



In GFS , each mutation is performed at all the chunk’s replicas. Leases to maintain a consistent mutation order across replicas. The master grants a chunk lease to one of the replicas. The primary picks a serial order for all mutations to the chunk. All replicas follow this order when applying mutations. Thus, the global mutation order is defined first by the lease grant order chosen by the master, and within a lease by the serial numbers assigned by the primary. So in this regard the control flow is similar to how HDFS works. Where GFS is different is how the data flows to the replicas.




Unlike HDFS , the client simultaneously pushes the data to all the replicas. A client can do so in any order. Each chunkserver will store the data in an internal LRU buffer cache until the data is used or aged out. By decoupling the data flow from the control flow, GFS improves performance by scheduling the expensive data flow based on the network topology regardless of which chunkserver is the primary. By decoupling the flow of data from the flow of control GFS uses the network efficiently. While control flows from the client to the primary and then to all secondaries, data is pushed linearly along a carefully picked chain of chunk servers in a pipelined fashion.