A large-scale distributed table

Paul Krzyzanowski

April 24, 2020

Goal: How can we build an ultra-high performance, low-latency storage service for large-scale structured and semi-structured data?


Traditional relational databases present a view of multiple tables, each containing rows and named columns. Queries, mostly performed in SQL (Structured Query Language) allow one to extract specific columns from a row where certain conditions are met (e.g., a column has a specific value). Moreover, one can perform queries across multiple tables (this is the “relational” part of a relational database). For example, a table of students may include a student’s name, ID number, and contact information. A table of grades may include a student’s ID number, course number, and grade. We can construct a query that extracts a grades by name by searching for the ID number in the student table and then matching that ID number in the grade table.

With traditional databases, we expect ACID guarantees: that transactions will be atomic, consistent, isolated, and durable. As we saw when we studied distributed transactions, it is not possible to guarantee consistency while providing high availability and network partition tolerance. We often choose to give up consistency in order to ensure high availability and partition tolerance. This makes ACID databases unattractive for highly distributed environments and led to the emergence of alternate data stores that are targeted to high availability and high performance.

Finally, most databases often do not do well with tables containing huge amounts of columns or fields that contain huge amounts of data. Adding additional fields to a column (changing the schema of the database) can be a time-consuming task.

Here, we will look at the structure and capabilities of Bigtable. It is not a relational database; it is just a table but it is designed to work on a huge scale.


Bigtable is a distributed storage system that is structured as a large table: one that may be petabytes in size and distributed among tens of thousands of machines. It is designed for storing items such as billions of URLs, with many versions per page; over 100 TB of satellite image data; hundreds of millions of users; and performing thousands of queries a second.

Bigtable was developed at Google in has been in use since 2005 in dozens of Google services. An open source version, HBase, was created by the Apache project on top of the Hadoop core. Apache Cassandra, first developed at Facebook to power their search engine, is similar to Bigtable with a tunable consistency model and no master (central server).

Bigtable is designed with semi-structured data storage in mind. It is a large map that is indexed by a row key, column key, and a timestamp. Each value within the map is an array of bytes that is interpreted by the application. Every read or write of data to a row is atomic, regardless of how many diferent columns are read or written within that row.

It is easy enough to picture a simple table. Let us look at a few characteristics of Bigtable:

A map is an associative array; a data structure that allows one to look up a value to a corresponding key quickly. Bigtable is a collection of (key, value) pairs where the key identifies a row and the value is the set of columns.
The data is stored persistently on disk.
Bigtable’s data is distributed among many independent machines. At Google, Bigtable is built on top of GFS (Google File System). The Apache open source version of Bigtable, HBase, is built on top of HDFS (Hadoop Distributed File System) or Amazon S3. The table is broken up among rows, with groups of adjacent rows managed by a server. A row itself is never distributed.
The table is sparse, meaning that different rows in a table may use vastly different columns (there could be millions), with many – or even most – of the columns empty for a particular row.
Most associative arrays are not sorted. A key is hashed to a position in a table. Bigtable sorts its data by keys. This helps keep related data close together, usually on the same machine — assuming that one structures keys in such a way that sorting brings the data together. For example, if domain names are used as keys in a Bigtable, it makes sense to store them in reverse order to ensure that related domains are close together. For example:
A table is indexed by rows. Each row contains one or more named column families. Column families are defined when the table is first created. Within a column family, one may have one or more named columns. All data within a column family is usually of the same type. The implementation of Bigtable usually compresses all the columns within a column family together. Columns within a column family can be created on the fly. Rows, column families and columns provide a three-level naming hierarchy to identify data. For example:
	"edu.rutgers.cs" : {    // row
		"users" : {    // column family
			"bjs" : "Bart",       // column
			"lsimpson" : "Lisa",  // column
			"homer" : "Homer"     // column
		"sysinfo" : {   // another column family
		    "os" : "Linux 3.12",  // column
		    "cpu" : "Xeon E5-2698"   // column

To get data from Bigtable, you need to provide a fully-qualified name in the form column-family:column. For example, users:homer or sysinfo:cpu

Time is another dimension in Bigtable data. Every column family may keep multiple versions of column family data. If an application does not specify a timestamp, it will retrieve the latest version of the column family. Alternatively, it can specify a timestamp and get the latest version that is earlier than or equal to that timestamp.

Columns and column families

Let’s look at a sample slice of a table that stores web pages (this example is from Google’s paper on Bigtable). The row key is the page URL. For example, com.cnn.www.

Figure 1. Bigtable column families and columns
Figure 1. Bigtable column families and columns

Various attributes of the page are stored in column families. A contents column family contains page contents (there are no columns within this column family). A language column family contains the language identifier for the page. Finally, an anchor column family contains the text of various anchors from other web pages. The column name is the URL of the page making the reference. These three column families underscore a few points.

A column may be a single short value, as seen in the language column family. This is our classic database view of columns. In Bigtable, however, there is no type associated with the column. It is just a bunch of bytes.

The data in a column family may also be large, as in the contents column family.

The anchor column family illustrates the extra hierarchy created by having columns within a column family. It also illustrates the fact that columns can be created dynamically (one for each external anchor), unlike column families.

Finally, it illustrates the sparse aspect of Bigtable. In this example, the list of columns within the anchor column family will likely vary tremendously for each URL.

In all, we may have a huge number (e.g., hundreds of thousands or millions) of columns but the column family for each row will have only a tiny fraction of them populated. While the number of column families will typically be small in a table (at most hundreds), the number of columns is unlimited.

Rows and partitioning

A table is logically split among rows into multiple subtables called tablets. A tablet is a set of consecutive rows of a table and is the unit of distribution and load balancing within Bigtable. Because the table is always sorted by row, reads of short ranges of rows are efficient: one typically communicates with a small number of machines. Hence, a key to ensuring a high degree of locality is to select row keys properly (as in the earlier example of using domain names in reverse order).


Each column family cell can contain multiple versions of content. For example, in the earlier example, we may have several timestamped versions of page contents associated with a URL. Each version is identified by a 64-bit timestamp that either represents real time or is a value assigned by the client. Reading column data retrieves the most recent version if no timestamp is specified or the latest version that is earlier than a specified timestamp.

A table is configured with per-column-family settings for garbage collection of old versions. A column family can be defined to keep only the latest n versions or to keep only the versions written since some time t.


Bigtable comprises a client library (linked with the user’s code), a master server that coordinates activity, and many tablet servers. Tablet servers can be added or removed dynamically.

The master assigns tablets to tablet servers and balances tablet server load. It is also responsible for garbage collection of files in GFS and managing schema changes (table and column family creation).

Each tablet server manages a set of tablets (typically 10–1,000 tablets per server). It handles read/write requests to the tablets it manages and splits tablets when a tablet gets too large. Client data does not move through the master; clients communicate directly with tablet servers for reads/writes. The internal file format for storing data is Google’s SSTable, which is a persistent, ordered, immutable map from keys to values.

Bigtable uses the Google File System (GFS) for storing both data files and logs. A cluster management system contains software for scheduling jobs, monitoring health, and dealing with failures.


Chubby is a highly available and persistent distributed lock service that manages leases for resources and stores configuration information. The service runs with five active replicas, one of which is elected as the master to serve requests. A majority must be running for the service to work. It uses a Paxos distributed consensus algorithm to keep the replicas consistently synchronized. Chubby provides a namespace of files & directories. Each file or directory can be used as a lock.

In Bigtable, Chubby is used to:

  • ensure there is only one active master
  • store the bootstrap location of Bigtable data
  • discover tablet servers
  • store Bigtable schema information
  • store access control lists

Startup and growth

Figure 2. Bigtable indexing hierarchy
Figure 2. Bigtable indexing hierarchy

A table starts off with just one tablet. As the table grows, it is split into multiple tablets. By default, a table is split at around 100 to 200 MB.

Locating rows within a Bigtable is managed in a three-level hierarchy. The root (top-level) tablet stores the location of all Metadata tablets in a special Metadata tablet. Each Metadata table contains the location of user data tablets. This table is keyed by node IDs and each row identifies a tablet’s table ID and end row. For efficiency, the client library caches tablet locations.

A tablet is assigned to one tablet server at a time. Chubby keeps track of tablet servers. When a tablet server starts, it creates and acquires an exclusive lock on a uniquely-named file in a Chubby servers directory. The master monitors this directory to discover new tablet servers. When the master starts, it:

  • Grabs a unique master lock in Chubby (to prevent multiple masters from starting)
  • Scans the servers directory in Chubby to find live tablet servers
  • Communicates with each tablet server to discover what tablets are assigned to each server
  • Scans the Metadata table to learn the full set of tablets
  • Builds a set of unassigned tablet servers, which are eligible for tablet assignment. These will be assigned by choosing a tablet server and sending it a tablet load request.

Fault tolerance and replication

Some of the fault tolerance for Bigtable is provided by Google and Chubby. GFS, for example, provides configurable levels of replication of file data and Chubby’s cell of replicated servers minimizes its downtime.

A master is responsible for detecting when a specific tablet server is not functioning. It does this by asking the tablet server for status of its lock (recall that Chubby grants locks). If the tablet server cannot be reached or has lost its lock, the master attempts to grab that server’s lock. If it succeeds, then it surmises the tablet server is dead or cannot contact Chubby. In this case, the master moves the tablets that were previously assigned to that server into an unassigned state.

When a master’s Chubby lease expires, it kills itself. This does not change the assignment of tablets to servers, however. Google’s cluster management system periodically checks for the liveness of a master. If it detects a non-responding master, it starts one up, which grabs a lock from Chubby. The new master contacts Chubby to find all the live servers goes through the startup phase described earlier.

A Bigtable can be configured for replication onto multiple Bigtable clusters in different data centers to ensure availability. Data propagation is asynchronous and results an eventually consistent model.


This is an updated version to one that was originally published in November 2011.

Last modified April 7, 2021.
recycled pixels