Wednesday, May 28, 2014

Google's Big Data Systems - Megastore vs Spanner


Megastore and Spanner both blend the traditional RDBMS and NoSQL worlds by providing highly available, scalable and a globally distributed system. Both are used primarily for online interactive services. This article highlights few notable features in each along with key differences.

Megastore

Lets first start with Megastore, which is used extensively within Google by several hundred applications. High availability is achieved through a synchronous, fault-tolerant log replicator optimized for cross data-centers writes. Scalability is achieved by partitioning the data into small databases each with its own NoSQL data store and replicated log stored in per replica.


Deployment
To scale throughput, it partitions the data into a collection of entity groups, which independently and synchronously are replicated over a wide area. Each entity group has entities, which are abstractions for rows in a traditional database tables.

Megastore supports single single-phase ACID transactions within an entity group while two-phase commit for atomic updates across entity groups through asynchronous messaging using queues A transaction writes its mutations into the entity group’s write-ahead log, and then the mutations are applied to the data.
It provides a single, consistent view of the data stored in its underlying replicas. Reads and writes can be initiated from any replica, and ACID semantics are preserved regardless of what replica a client starts from. Replication is done per entity group by synchronously replicating the group’s transaction log to quorum of replicas.

Data Model
Megastore provides a semi-relational data model that abstract tuples of an RDBMS with column storage of NoSQL – it uses Bigtable as its storage subsystem. Each schema has a set of tables, each containing a set of entities, which in turn contain a set of properties. Properties are named and typed values. A sequence of properties forms the primary key of the entity. Keys are chosen to cluster entities that will be read together.

Storage and Layout
Megastore uses Bigtable as its scalable datastore while adding richer primitives such as ACID transactions, indexes, and queues. Each entity is mapped into a single Bigtable row; the primary key values are concatenated to form the Bigtable row key, and each remaining property occupies its own Bigtable column. The column name is a concatenation of the Megastore table name and the property name, allowing entities from different Megastore tables to be mapped into the same Bigtable row. Bigtable provides the ability to store multiple values in the same row/column pair with different timestamps, which is used by Megastore to implement multiversion concurrency control (MVCC).

Applications control the latency & throughput by placement of data through the selection of Bigtable instances and specification of locality within an instance. To minimize latency, applications keep data near users and replicas near each other. Entity groups are assigned close to the region or continent from which it is accessed most. Within that region it can assign a triplet or quintuplet of replicas to datacenters with isolated failure domains. To maximize throughput, the data for an entity group are held in contiguous ranges of Bigtable rows. The schema language lets applications control the placement of hierarchical data, storing data that is accessed together in nearby rows or denormalized into the same row.



Paxos
The Paxos algorithm is a way to reach consensus among a group of replicas on a single value. Databases typically use Paxos to replicate a transaction log, where a separate instance of Paxos is used for each position in the log. New values are written to the log at the position following the last chosen position.

The Paxos algorithm demands multiple rounds of communication with high network latency. Writes usually require at least two round trips before consensus is achieved: a round of prepares with a subsequent round of accepts. Reads require at least one round of prepares to determine the last chosen value. Real world systems built on Paxos reduce the number of round trips required to make it a practical algorithm.

Megastore uses Paxos to replicate primary user data across datacenters on every write. To minimize latency, many distributed systems use a dedicated master to which all reads and writes are directed. Reliance on a master limits flexibility for reading and writing. Megastore unlike dedicated master systems allows current reads against any replica. This gives it better utilization and low latencies.


Spanner

Lets turn out attention to Spanner which combines the familiar, easy-to-use, semi-relational interface, transactions, and an SQL-based query language with scalability, automatic sharding, fault tolerance, consistent replication, external consistency, and wide area distribution.

Spanner has evolved from a Bigtable-like versioned key-value store into a temporal multi-version database. Data is stored in schematized semi-relational tables; data is versioned, and each version is automatically timestamped with its commit time. Spanner supports general-purpose transactions, and provides an SQL-based query language. Unlike Bigtable, Spanner assigns timestamps to data, which is an important way in which Spanner is more like a multi-version database than a key-value store.

Similar to Megastore, the replication configurations for data can be dynamically controlled at a fine grain by applications. Applications can specify constraints to control which datacenters contain which data, how far data is from its users (to control read latency), how far replicas are from each other, and how many replicas are maintained - to control durability, availability, and read performance. Data can also be dynamically and transparently moved between datacenters by the system to balance resource usage across data centers.

Deployment
A Spanner deployment is called a universe. It is organized as a set of zones. Zones are the unit of administrative deployment which be added or removed tied to the corresponding datacenters.

Fig 3 : Spanner Server Configuration


A zone has one zonemaster with several thousand spanservers. The intermediate location proxies are used by clients to locate the spanservers assigned to serve their data. The universe master and the placement driver are singletons. Universe master provides universe manageability and debuggability while placement driver is for data replication. Each spanserver in turn has thousands of instances of tablet(s). A tablet is similar to Bigtable’s tablet.

(key:string, timestamp:int64) → string

Each spanserver implements a single Paxos state machine on top of each tablet. The Paxos state machines are used for replication. The key-value mapping state of each replica is stored in its corresponding tablet. Writes must initiate the Paxos protocol at the leader; reads access state directly from the underlying tablet at any replica that is sufficiently up-to-date. The set of replicas is collectively a Paxos group.


At every replica that is a leader, each spanserver implements a

1)       Lock table for concurrency control. The lock table contains the state for two-phase locking: it maps ranges of keys to lock states.

2)       A transaction manager to support distributed transactions. The transaction manager has a participant leader while the other replicas are participant slaves. If a transaction involves only one Paxos group it bypasses the transaction manager while if transaction involves more than one Paxos group, those groups’ leaders coordinate to perform two- phase commit. With one of the participant groups is chosen as the coordinator.


Figure 4: Spanserver Software Stack


Data Model
Applications create one or more databases in a universe. Each database can contain unlimited number tables. Tables look like relational-database tables, with rows, columns, and versioned values that can be queries through SQL like language.

Spanner’s data model is not purely relational. Every table is required to have one or more primary-key columns. The primary keys form the name for a row, and each table defines a mapping from the primary-key columns to the non-primary-key columns.

Storage & Layout
Spanner organizes data into Directory (ies), which is a unit of data placement. All data in a directory has the same replication configuration. Paxos group may contain multiple directories implies that a Spanner tablet is different from the Bigtable tablet – the latter is lexicographically contiguous partition of the row space. Directory is also the smallest unit whose placement can be configured by an application.

Time Synchronization
A unique capability of Spanner is that it has an ingenious implementation for clock synchronization with a strong TrueTime API.  This enables it to support globally commit timestamps for distributed transactions. The timestamps reflect serialization order with external consistency, in brief also called linearizability.  
TrueTime API is implemented by a set of time master servers per datacenter and a timeslave daemon per node. The majority of masters have GPS receivers with dedicated antennas. The remaining masters have atomic clocks. The masters not only compare their time references against each other, but also crosscheck the rate at which its reference advances time against its own local clock.

With the above background lets enumerate the similarities and differences between the systems.

Similarities


Spanner
Megastore
Multi Data Center
Yes
Yes
Consistency
Strong
Strong
Replication
Application controlled
Application controlled
State management
Paxos
Paxos
Transactions
MVCC
MVCC



Differences


Spanner
Megastore
Storage
Custom
Bigtable
Linearizability
Yes
No
Globally Consistent Transactions
Yes
No
Schema
Complex

Semi-relational
Query Interface
SQL like
Unknown
Storage
Custom
BigTable
Time Synchronization
TrueTime API
N/A
Deployment
Universe; Zones and
Spanservers
Entity Groups; BigTable
Data Storage
Directories/Placements
Entity groups
Long Lived Transactions
Yes
No
Performance
High
Relatively lower – lower throughput if there are several writes per paxos group
Applications
F1/AdWords
300 - Gmail, Picasa, calendar, Android, Market and AppEngine