Megastore and Spanner both blend the traditional RDBMS and NoSQL worlds
by providing highly available, scalable and a globally distributed system. Both
are used primarily for online interactive services. This article highlights few
notable features in each along with key differences.
Megastore
Lets first start with Megastore, which is used extensively within Google
by several hundred applications. High availability is achieved through a
synchronous, fault-tolerant log replicator optimized for cross data-centers
writes. Scalability is achieved by partitioning the data into small databases
each with its own NoSQL data store and replicated log stored in per replica.
Deployment
To scale throughput, it partitions the data into a collection of entity
groups, which independently and synchronously are replicated over a wide area.
Each entity group has entities, which are abstractions for rows in a
traditional database tables.
Megastore supports single single-phase ACID transactions within an
entity group while two-phase commit for atomic updates across entity groups
through asynchronous messaging using queues A transaction writes its mutations
into the entity group’s write-ahead log, and then the mutations are applied to
the data.
It provides a single, consistent view of the data stored in its
underlying replicas. Reads and writes can be initiated from any replica, and
ACID semantics are preserved regardless of what replica a client starts from.
Replication is done per entity group by synchronously replicating the group’s
transaction log to quorum of replicas.
Data Model
Megastore provides a semi-relational data model that abstract tuples of
an RDBMS with column storage of NoSQL – it uses Bigtable as its storage
subsystem. Each schema has a set of tables, each containing a set of entities,
which in turn contain a set of properties. Properties are named and typed
values. A sequence of properties forms the primary key of the entity. Keys are
chosen to cluster entities that will be read together.
Storage and Layout
Megastore uses Bigtable as its scalable datastore while adding richer
primitives such as ACID transactions, indexes, and queues. Each entity is
mapped into a single Bigtable row; the primary key values are concatenated to
form the Bigtable row key, and each remaining property occupies its own
Bigtable column. The column name is a concatenation of the Megastore table name
and the property name, allowing entities from different Megastore tables to be
mapped into the same Bigtable row. Bigtable provides the ability to store
multiple values in the same row/column pair with different timestamps, which is
used by Megastore to implement multiversion concurrency control (MVCC).
Applications control the latency & throughput by placement of data
through the selection of Bigtable instances and specification of locality
within an instance. To minimize latency, applications keep data near users and
replicas near each other. Entity groups are assigned close to the region or
continent from which it is accessed most. Within that region it can assign a
triplet or quintuplet of replicas to datacenters with isolated failure domains.
To maximize throughput, the data for an entity group are held in contiguous
ranges of Bigtable rows. The schema language lets applications control the
placement of hierarchical data, storing data that is accessed together in
nearby rows or denormalized into the same row.
Paxos
The Paxos algorithm is a way to reach consensus among a group of
replicas on a single value. Databases typically use Paxos to replicate a
transaction log, where a separate instance of Paxos is used for each position
in the log. New values are written to the log at the position following the
last chosen position.
The Paxos algorithm demands multiple rounds of communication with high
network latency. Writes usually require at least two round trips before
consensus is achieved: a round of prepares with a subsequent round of accepts.
Reads require at least one round of prepares to determine the last chosen
value. Real world systems built on Paxos reduce the number of round trips
required to make it a practical algorithm.
Megastore uses Paxos to replicate
primary user data across datacenters on every write. To minimize latency, many
distributed systems use a dedicated master to which all reads and writes are
directed. Reliance on a master limits flexibility for reading and writing.
Megastore unlike dedicated master systems allows current reads against any
replica. This gives it better utilization and low latencies.
Spanner
Lets turn out attention to Spanner which combines the familiar, easy-to-use,
semi-relational interface, transactions, and an SQL-based query language with
scalability, automatic sharding, fault tolerance, consistent replication,
external consistency, and wide area distribution.
Spanner has evolved from a Bigtable-like versioned key-value store into
a temporal multi-version database. Data is stored in schematized
semi-relational tables; data is versioned, and each version is automatically
timestamped with its commit time. Spanner supports general-purpose
transactions, and provides an SQL-based query language. Unlike Bigtable,
Spanner assigns timestamps to data, which is an important way in which Spanner
is more like a multi-version database than a key-value store.
Similar to Megastore, the replication configurations for data can be
dynamically controlled at a fine grain by applications. Applications can
specify constraints to control which datacenters contain which data, how far
data is from its users (to control read latency), how far replicas are from
each other, and how many replicas are maintained - to control durability,
availability, and read performance. Data can also be dynamically and
transparently moved between datacenters by the system to balance resource usage
across data centers.
Deployment
A Spanner deployment is called a universe. It is organized as a
set of zones. Zones are the unit of administrative deployment which be
added or removed tied to the corresponding datacenters.
Fig 3 : Spanner Server Configuration
A zone has one zonemaster with several thousand spanservers.
The intermediate location proxies are used by clients to locate the
spanservers assigned to serve their data. The universe master and the placement
driver are singletons. Universe master provides universe manageability and
debuggability while placement driver is for data replication. Each spanserver
in turn has thousands of instances of tablet(s). A tablet is similar to
Bigtable’s tablet.
(key:string, timestamp:int64) → string
Each spanserver implements a single Paxos state machine on top of each
tablet. The Paxos state machines are used for replication. The key-value
mapping state of each replica is stored in its corresponding tablet. Writes
must initiate the Paxos protocol at the leader; reads access state directly
from the underlying tablet at any replica that is sufficiently up-to-date. The
set of replicas is collectively a Paxos group.
At every replica that is a leader, each spanserver implements a
1)
Lock table for concurrency control. The lock table contains
the state for two-phase locking: it maps ranges of keys to lock states.
2)
A transaction manager to support
distributed transactions. The transaction manager has a participant leader
while the other replicas are participant slaves. If a transaction
involves only one Paxos group it bypasses the transaction manager while if
transaction involves more than one Paxos group, those groups’ leaders
coordinate to perform two- phase commit. With one of the participant groups is
chosen as the coordinator.
Figure 4: Spanserver Software Stack
Data Model
Applications create one or more databases in a universe. Each
database can contain unlimited number tables. Tables look like
relational-database tables, with rows, columns, and versioned values that can
be queries through SQL like language.
Spanner’s data model is not purely relational. Every table is required
to have one or more primary-key columns. The primary keys form the name for a
row, and each table defines a mapping from the primary-key columns to the
non-primary-key columns.
Storage & Layout
Spanner organizes data into Directory (ies), which is a unit of data
placement. All data in a directory has the same replication configuration.
Paxos group may contain multiple directories implies that a Spanner tablet is
different from the Bigtable tablet – the latter is lexicographically contiguous
partition of the row space. Directory is also the smallest unit whose placement
can be configured by an application.
Time Synchronization
A unique capability of Spanner is that it has an ingenious
implementation for clock synchronization with a strong TrueTime API. This
enables it to support globally commit timestamps for distributed transactions.
The timestamps reflect serialization order with external consistency, in brief
also called linearizability.
TrueTime API is implemented by a set of time master servers per
datacenter and a timeslave daemon per node. The majority of masters have GPS
receivers with dedicated antennas. The remaining masters have atomic clocks.
The masters not only compare their time references against each other, but also
crosscheck the rate at which its reference advances time against its own local
clock.
With the above background lets enumerate the similarities and
differences between the systems.
Similarities
Spanner
|
Megastore
|
|
Multi Data Center
|
Yes
|
Yes
|
Consistency
|
Strong
|
Strong
|
Replication
|
Application controlled
|
Application controlled
|
State management
|
Paxos
|
Paxos
|
Transactions
|
MVCC
|
MVCC
|
Differences
Spanner
|
Megastore
|
|
Storage
|
Custom
|
Bigtable
|
Linearizability
|
Yes
|
No
|
Globally Consistent Transactions
|
Yes
|
No
|
Schema
|
Complex
|
Semi-relational
|
Query Interface
|
SQL like
|
Unknown
|
Storage
|
Custom
|
BigTable
|
Time Synchronization
|
TrueTime API
|
N/A
|
Deployment
|
Universe; Zones and
Spanservers
|
Entity Groups; BigTable
|
Data Storage
|
Directories/Placements
|
Entity groups
|
Long Lived Transactions
|
Yes
|
No
|
Performance
|
High
|
Relatively lower – lower throughput if
there are several writes per paxos group
|
Applications
|
F1/AdWords
|
300 - Gmail, Picasa, calendar,
Android, Market and AppEngine
|