Spanner: Googles Globally-Distributed Database James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford
Google, Inc. Figures taken from paper and Alex Lloyds presentation at Berlinbuzzwords-2012 Agenda
Outline and Key Features System Architecture Software Stack
Data Model TrueTime Evaluation Case Study Outline Next step from Bigtable in RDBMS path with strong time semantics
Key Features: Temporal Multi-version database Externally consistent global write-transactions with synchronous replication. Transactions across Datacenters. Lock-free read-only transactions. Schematized, semi-relational (tabular) data model. SQL-like query interface.
Key Features cont. Auto-sharding, auto-rebalancing, automatic failure response. Exposes control of data replication and placement to user/application. Enables transaction serialization via global timestamps Acknowledges clock uncertainty and guarantees a
bound on it Uses novel TrueTime API to accomplish concurrency control Uses GPS devices and Atomic clocks to get accurate time System Architecture Software Stack
Software Stack cont. (key:string, timestamp:int64) string Back End: Colossus (successor to GFS) Paxos State Machine on top of each tablet stores meta data and logs of the tablet. Leader among replicas in a Paxos group is chosen and all write requests for replicas in that group initiate at leader.
Transaction Leader Is Paxos Leader if transaction involves one Paxos group Software Stack cont. Directory analogous to bucket in BigTable Smallest unit of data placement Smallest unit to define replication properties
Directory might in turn be sharded into Fragments if it grows too large. Datamodel One or more databases supported in Spanner Universe Database can contain unlimited schematized tables Not purely relational Requires rows to have names
Names are nothing but a set(can be singleton) of primary keys In a way, its a key value store with primary keys mapped to nonkey columns as values Datamodel TrueTime Novel API behind Spanners core innovation Leverages hardware features like GPS and Atomic Clocks
Implemented via TrueTime API. Key method being now() which not only returns current system time but also another value () which tells the maximum uncertainty in the time returned Set of time master server per datacenters and time slave daemon per machines. Majority of time masters are GPS fitted and few others are atomic clock fitted (Armageddon masters).
Daemon polls variety of masters and reaches a consensus about correct timestamp. TrueTime TrueTime TrueTime Cont.
TrueTime uses both GPS and Atomic clocks since they are different failure rates and scenarios. Two other boolean methods in API are After(t) returns TRUE if t is definitely passed Before(t) returns TRUE if t is definitely not arrived TrueTime uses these methods in concurrency control and t serialize transactions.
TrueTime Cont. After() is used for Paxos Leader Leases Uses after(Smax) to check if Smax is passed so that Paxos Leader can abdicate its slaves. Paxos Leaders can not assign timestamps(Si) greater than Smax for transactions(Ti) and clients can not see the data
commited by transaction Ti till after(Si) is true. After(t) returns TRUE if t is definitely passed Before(t) returns TRUE if t is definitely not arrived Replicas maintain a timestamp tsafe which is the maximum timestamp at which that replica is up to date. TrueTime Transactions
Read-Write requires lock. Read-Only lock free. Requires declaration before start of transaction. Reads information that is up to date Snapshot Read Read information from past by specifying a timestamp or bound Use specifies specific timestamp from past or timestamp bound so
that data till that point will be read. Evaluation Evaluated for replication, transactions and availability. Results on epsilon of TrueTime Benchmarked on Spanner System with 50 Paxos groups 250 Directories
Clients(applicatons) and Zones are at a network distance of 1ms Evaluation - Availability Evaluation - Epsilon Case Study Spanner is currently in production used by Googles
advertising backend F1. F1 previously used MySQL since it requires strong transactional semantics which NoSQL database solution impractical. Spanner provides synchronous replication and automatic failover for F1. Case Study cont.
Enabled F1 to specify data placement via directories of spanner based on their needs. F1 operation latencies measured over 24 hours Thank You