Tunable Consistency For SQL Read Replicas

Scaling read operations for SQL databases often involves read replicas, or read-only copies of the primary database with dedicated resources. Replication of changes from the primary to any replicas is typically asynchronous which introduces replication latency. That indeterminate delay, however small, introduces eventual consistency which can cause subtle and hard to recreate errors in systems expecting the usual read-after-write guarantees. This article details a method of allowing a reader to choose the consistency level it needs without sacrificing the use of read replicas.

The Issue With Eventual Consistency

To help keep things grounded, we’ll work with a simple but fictional system that manages "widgets". I’ll give schema and query examples for MySQL and PostgreSQL but the techniques should be applicable to any classic SQL database^[1].


MySQL	`CREATE TABLE widgets ( id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255) NOT NULL );`
PostgreSQL	`CREATE TABLE widgets ( id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name VARCHAR(255) NOT NULL );`

MySQL

CREATE TABLE widgets (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL
);

PostgreSQL

CREATE TABLE widgets (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    name VARCHAR(255) NOT NULL
);

The widget database then powers some API or application that writes and reads widget data.

In this scenario, the system has grown to a throughput of read operations that exceeds the capacity of the database. Let’s assume that we’ve already exhausted the opportunities to vertically scale the database. It is no longer viable or reasonable to continue adding more system resources to the primary database in order to meet the read operation throughput. The next option, then, is to scale horizontally by adding read replicas over which we can distribute the operational load.

The primary database is replicated asynchronously to two read replicas.

For now, let’s assume we’re using the database’s built-in replication protocol to populate read replicas. Let’s also assume that the replication is asynchronous, preserves the atomicity of transactions, and applies transactions in the same order as they are applied to the primary database. These properties are common for built-in replication protocols but they may require specific configuration. For example, PostgreSQL’s streaming replication^[2] and MySQL’s GTID mode^[3] can be configured to work this way.

Before this system used read replicas we could generally expect that any write operation committed before a read operation started would be visible to that read operation. That is, if we created a widget and then selected all widgets we’d expect to see the new widget in the resulting list. A guarantee for read operations to see all previously committed write operations is called strong consistency^[4]. However, the use of asynchronously populated read replicas means we can now be subject to timing conditions that break strong consistency.

The asynchronous replication results in a read after write not returning the expected results.

In this scenario the API queries a randomly selected read replica after writing a new widget to the primary database. However, the write has only replicated to one of the replicas by the time the API issued the query and the API selected a different replica as part of the random choice. The new widget record does eventually arrive at the replica that the API read from but there’s no way to identify which replica is up to date at any given time. This indeterminate amount of time between a write and the replication of that write is a source of eventual consistency^[5] and is a core problem to manage when introducing read replicas.

In my experience, there are a surprising number of cases where eventual consistency is acceptable and unnoticeable even in systems that were built without intentional support for it. There are some cases, though, where strong consistency is required to solve a problem either by the nature of that problem or by the complexity and challenges of changing the consistency expectations in an existing system. The goal from here is to adapt the data management to allow certain callers or use cases to opt back in to strong consistency even when eventually consistent read replicas are in use.

Client Selected Consistency Levels

One way to integrate eventually consistent read replicas into a system is to statically assign read operations to either the primary or a replica based on consistency needs. For example, we might audit all the read operations and divert those known to be safe for eventual consistency to a read replica. Alternatively, we might default all read operations to a replica and re-assign operations to the primary as we identify issues caused by the eventual consistency. This technique can be manageable when the expectations of all possible read operations are known and have distinct code paths. This may be true for some monolithic applications^[6] but heavily modularized code^[7] and contemporary service oriented^[8] systems often have varied or even unknown expectations from read operations that go through the same code paths.

Let’s imagine that our widget API has the following interface:

service Widgets {
    rpc Create(Widget)    returns (WidgetResponse)
    rpc Update(Widget)    returns (WidgetResponse)
    rpc Delete(Widget)    returns (WidgetResponse)
    rpc Read(ReadRequest) returns (WidgetResponse)
}
message Widget {
    int64  id    = 1;
    string name  = 2;
}
message ReadRequest {
    int64 id = 1;
}
message WidgetResponse {
    Widget widget = 1;
}

Whether this API represents a unit of code embedded within a larger application or is available over a network using some protocol such as HTTP/REST^[9] or gRPC^[10] is not material in this scenario. The important aspect of the API is that it encapsulates the database such that all access to widget data passes through it. This interface presents a single method for reading data that cannot be statically assigned to either the primary or a replica. Statically assigning the read method to the primary would negate the purpose of having replicas at all. Assigning the read method to one or a pool of replicas would mean forcing eventual consistency on all readers. The first step, then, is to introduce some request parameter that a reader can use to indicate their desired consistency level:

service Widgets {
    rpc Create(Widget)    returns (WidgetResponse)
    rpc Update(Widget)    returns (WidgetResponse)
    rpc Delete(Widget)    returns (WidgetResponse)
    rpc Read(ReadRequest) returns (WidgetResponse)
}
// ...
message ReadRequest {
    int64 id = 1;
    ConsistencyLevel level = 2;
}
enum ConsistencyLevel {
    FASTEST = 0;
    STRONG = 1;
}

In this case, a request for fastest indicates that the client wants the fastest possible response even if that means being subject to eventual consistency. These requests are routed to a replica. A request for strong indicates that the response must be from a strongly consistent source. These requests are routed to the primary. This type of API allows the client to select the appropriate consistency level for a given use case. The API code can then use the parameter to select an appropriate backend based on the consistency needs.

This represents the simplest form of client selected, tunable consistency for read operations. However, this design doesn’t result in optimal replica usage for common usage patterns. An initial or context free read of widgets through the API can likely use the fastest option without issues. Once any API user performs a write operation, though, they are likely to then require the next read to be strongly consistent so that it doesn’t read from a source that does not yet have the change it just made. That read-after-write workflow can be a common one which means API users may commonly opt for strong which reduces the usefulness of having replicas. A key question to ask is: Does the average read operation need to be strongly consistent across all potential changes to the database or only consistent with the last change made by the current user?

At-Least-As Consistency

The concept of strong consistency is oriented around the global state of a primary database and the order in which it applies transactions. However, most user facing use cases typically associated with strong consistency are more accurately described in terms relative to a single user of the database. When a user makes a modification then they expect their next read operation to include that modification but that expectation is usually irrespective of changes made by other users in that time. If it is sufficient to guarantee that a user’s last change is always visible in that same user’s next read then the requirement is more accurately described as read-after-write consistency, which is not as strict as strong consistency.

A complication for read-after-write is that multiple users may be operating on the same records at or around the same time. For example, if widgets had an "updated at" timestamp embedded in the data model then a read-after-write by a single user should always see the timestamp increase. If there were no concurrent modifications to that record by other users then this user would read their own write. However, it’s possible for a second user to make a modification to the same record in between the first user’s write and read operations. In this case, the timestamp returned would be increased but the value may actually reflect the write of the second user. This nuance is better described with at-least-as consistency rather than read-after-write because reads include at least a user’s write but may also include more.

One way to express this concept in the API is to introduce some token that represents a point in time of the database that is returned to API users when they make modifications and can be provided by API users when they read data.

service Widgets {
    rpc Create(Widget)    returns (WidgetResponse)
    rpc Update(Widget)    returns (WidgetResponse)
    rpc Delete(Widget)    returns (WidgetResponse)
    rpc Read(ReadRequest) returns (WidgetResponse)
}
message Widget {
    int64  id    = 1;
    string name  = 2;
}
message ReadRequest {
    int64 id = 1;
    ConsistencyLevel level = 2;
    TransactionToken token = 3;
}
enum ConsistencyLevel {
    FASTEST = 0;
    STRONG = 1;
    AT_LEAST_AS = 2;
}
message WidgetResponse {
    Widget widget = 1;
    TransactionToken token = 2;
}
message TransactionToken {
    string value = 1;
}

The key changes here are that all responses now include some token value that represents a point in time. That token can then be optionally given along with the new at-least-as consistency level to request a read from any source that is at least as up to date as the given token.

The benefit of the at-least-as option is that it provides the opportunity to maximize the use of replicas while still offering a practical form of read-after-write consistency to callers. Traffic can still be diverted away from the primary and to a replica as long as that replica is at least as up to date as the provided token. Replication latencies vary but contemporary disk and network speeds can provide for low single or double digit millisecond replication. Depending on the speed and behavior of API clients, this can result in the majority of queries being served by a replica without breaking client expectations.

Transaction Identity Tokens

The key to supporting at-least-as consistency with read replicas is managing transaction tokens. A transaction token is a value that uniquely identifies a transaction applied to the primary database. One of the simplest forms of a transaction token that can be implemented across all classic SQL databases is an auto-incrementing sequence:


MySQL	`CREATE TABLE transaction_tokens ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY );`
PostgreSQL	`CREATE TABLE transaction_tokens ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY );`

MySQL

CREATE TABLE transaction_tokens (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
);

PostgreSQL

CREATE TABLE transaction_tokens (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
);

The auto-increment and identity column types provide a strictly increasing integer value which satisfies the uniqueness requirement of a transaction token. The size of the integer type determines the maximum number of transactions that can be applied to the database over its entire lifetime. A standard 32bit integer supports at least two billion which is likely enough for most use cases. A 64bit integer is effectively boundless. Note that the guarantees of uniqueness and strictly increasing values for the entire lifetime of the database may depend on the exact version and engine in use. For example, using a MySQL version earlier than 8 with InnoDB would mean that the auto-increment values can be re-used if the largest value in the table is deleted and then the database restarted.

To use these tokens, the database client (our widget API) starts a new database transaction, inserts a row into the transaction_tokens table, applies any modifications such as inserting or updating widgets, and commits the transaction. The resulting transaction token value can then be returned to API users for future read operations. For example:


MySQL	`START TRANSACTION; INSERT INTO transaction_tokens VALUES (); SELECT LAST_INSERT_ID(); INSERT INTO widgets (id, name) VALUES (1, "one"); COMMIT;`
PostgreSQL	`BEGIN; INSERT INTO transaction_tokens DEFAULT VALUES RETURNING id; INSERT INTO widgets (id, name) VALUES (1, 'one'); COMMIT;`

MySQL

START TRANSACTION;
INSERT INTO transaction_tokens VALUES ();
SELECT LAST_INSERT_ID();
INSERT INTO widgets
    (id, name)
    VALUES
    (1, "one");
COMMIT;

PostgreSQL

BEGIN;
INSERT INTO transaction_tokens
    DEFAULT VALUES
    RETURNING id;
INSERT INTO widgets
    (id, name)
    VALUES
    (1, 'one');
COMMIT;

The result is that each transaction is associated with a unique value in the transaction tokens table. The replication protocol in use operates on transactions so the insert into the token table and any other modifications made within that same transaction are atomically applied to replicas. This means that the state of a replica can be determined by querying for the presence of the token value in the transaction tokens table.


MySQL	`SELECT id FROM transaction_tokens WHERE id = ?;`
PostgreSQL	`SELECT id FROM transaction_tokens WHERE id = ?;`

MySQL

SELECT id FROM transaction_tokens WHERE id = ?;

PostgreSQL

SELECT id FROM transaction_tokens WHERE id = ?;

The ability to query a replica for its current state enables a database client, like our API, to check whether a replica is capable of resolving a query with at-least-as consistency based on any given transaction token value.

An important quality of this type of transaction token is that tokens do not appear in replicas according to their sort order nor do they become visible in the primary according to their sort order. These tokens do not reflect the order in which transactions are actually committed. They only provide transactions with a unique identity. The guarantee of at-least-as consistency is that the database contains, at least, the given token but it makes no guarantees about ordering of commits or whether any other commits were applied before or after the token. This means that any comparison of tokens other than equality, such as greater than or less than, are not valid when using these types of "transaction identity tokens".

Generally, this method of using auto-increment values solves for at-least-as consistency checks when replication is transactional. At-least-as consistency, especially when combined with fastest and strong consistency options, provides for the majority of consistency needs in applications while maximizing the use of read replicas.

Exactly-As Consistency

At-least-as consistency is good enough for the vast majority of use cases but it leaves a potential gap for reading at exactly the point in time of a token. These types of queries target a specific snapshot of the data that exactly matches the state of the database at the point in time at which a transaction was applied. In contrast to at-least-as, these queries can be described as requiring exactly-as consistency and require a response that contains a specified change and all previous changes but must not contain any later changes.

Exactly-as consistency isn’t a native feature of most classic SQL databases and represents an advanced use case due to the unusual consistency requirement. The coordination between application logic and custom table schema required to support exactly-as is more complex than at-least-as and the queries are often difficult to optimize. My recommendation is to avoid exactly-as unless you are building a specialized system that clearly requires this kind of query such as some systems built for the security or finance domains.

A system that supports exactly-as queries must:

Implement soft deletes
Model data as either immutable or keep a history of modifications to data
Use transaction tokens associated with commit time, not transaction identity

Soft Deletes

An exactly-as query needs access to all records that existed at the point in time represented by a transaction token. Record existence has two natural boundaries: creation time and deletion time. Adding a column to a table to track the token representing the creation time is easy enough. However, delete operations generally result in information loss because the records are no longer present in the database. This creates a need for soft-deletes where deleted records are marked as deleted but left in the database. Continuing the widget example, the table schema would need to become something like this:


MySQL	`CREATE TABLE widgets ( id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255) NOT NULL, created BIGINT UNSIGNED NOT NULL, deleted BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615 );`
PostgreSQL	`CREATE TABLE widgets ( id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name VARCHAR(255) NOT NULL, created BIGINT NOT NULL, deleted BIGINT NOT NULL DEFAULT 9223372036854775807 );`

MySQL

CREATE TABLE widgets (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created BIGINT UNSIGNED NOT NULL,
    deleted BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615
);

PostgreSQL

CREATE TABLE widgets (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    created BIGINT NOT NULL,
    deleted BIGINT NOT NULL DEFAULT 9223372036854775807
);

In this example, the created column is always set when inserting a new record and contains the transaction token for that point in time. The deleted column defaults to the max token value possible which is equivalent to indicating that the record is not deleted. Deleting a record then becomes an update statement:


MySQL	`UPDATE widgets SET deleted = ? WHERE id = ?;`
PostgreSQL	`UPDATE widgets SET deleted = ? WHERE id = ?;`

MySQL

UPDATE widgets SET deleted = ? WHERE id = ?;

PostgreSQL

UPDATE widgets SET deleted = ? WHERE id = ?;

The value of the deleted token is always set to the current transaction token associated with the change when the delete is issued. An exactly-as query, then, can select for only values that were created but not deleted at the given token value.


MySQL	`SELECT * FROM widgets WHERE created <= ? AND deleted > ?;`
PostgreSQL	`SELECT * FROM widgets WHERE created <= ? AND deleted > ?;`

MySQL

SELECT * FROM widgets
WHERE created <= ? AND deleted > ?;

PostgreSQL

SELECT * FROM widgets
WHERE created <= ? AND deleted > ?;

A query for all records created at or before a token that are marked deleted strictly after that token represents a snapshot of all records that existed at that point in time.

Immutable Or Historical Versions

Determining which records existed at a point in time is only one part of the snapshot process unless all columns other than deleted are immutable. If any of the other columns can be modified then the snapshot must also rebuild the exact state of every record that existed at that point in time.

There aren’t many database system designs that allow for immutable data. Append-only logs and ledgers are the most common for immutable data but they aren’t, themselves, particularly common designs. Systems where all columns of data are used to define the identity of a record also tend to have a natural immutability but an entire database schema of key-only tables is also uncommon. More commonly, systems are designed around mutable records. The way to bridge this gap is to copy the full state of every record to a historical versioning table each time the row is modified, including the initial creation. For example, consider the following addition to the widget schema:


MySQL	`CREATE TABLE widgets ( id INT AUTO_INCREMENT PRIMARY KEY, name VARCHAR(255) NOT NULL, modified BIGINT UNSIGNED NOT NULL, deleted BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615 ); CREATE TABLE widgets_versions ( id INT, name VARCHAR(255) NOT NULL, modified BIGINT UNSIGNED NOT NULL, deleted BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615 );`
PostgreSQL	`CREATE TABLE widgets ( id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, name VARCHAR(255) NOT NULL, modified BIGINT NOT NULL, deleted BIGINT NOT NULL DEFAULT 9223372036854775807 ); CREATE TABLE widgets_versions ( id INT, name VARCHAR(255) NOT NULL, modified BIGINT NOT NULL, deleted BIGINT NOT NULL DEFAULT 9223372036854775807 );`

MySQL

CREATE TABLE widgets (
    id INT AUTO_INCREMENT PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    modified BIGINT UNSIGNED NOT NULL,
    deleted BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615
);
CREATE TABLE widgets_versions (
    id INT,
    name VARCHAR(255) NOT NULL,
    modified BIGINT UNSIGNED NOT NULL,
    deleted BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615
);

PostgreSQL

CREATE TABLE widgets (
    id INT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    name VARCHAR(255) NOT NULL,
    modified BIGINT NOT NULL,
    deleted BIGINT NOT NULL DEFAULT 9223372036854775807
);
CREATE TABLE widgets_versions (
    id INT,
    name VARCHAR(255) NOT NULL,
    modified BIGINT NOT NULL,
    deleted BIGINT NOT NULL DEFAULT 9223372036854775807
);

If every INSERT, UPDATE, and DELETE operation on the main table also results in an insertion into the version table then the system maintains a complete history of change. This property is equivalent to immutability for the purposes of exactly-as because it enables querying for the exact state of rows that existed for any given token value using window functions^[11].


MySQL	`WITH most_recent AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY modified DESC) AS row_rank FROM widgets_versions WHERE modified <= ? ) SELECT id, name, modified, deleted FROM most_recent WHERE row_rank = 1 AND deleted > ?;`
PostgreSQL	`WITH most_recent AS ( SELECT *, ROW_NUMBER() OVER (PARTITION BY id ORDER BY modified DESC) AS row_rank FROM widgets_versions WHERE modified <= ? ) SELECT id, name, modified, deleted FROM most_recent WHERE row_rank = 1 AND deleted > ?;`

MySQL

WITH most_recent AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY modified DESC) AS row_rank
    FROM
        widgets_versions
    WHERE
        modified <= ?
)
SELECT
    id, name, modified, deleted
FROM
    most_recent
WHERE
    row_rank = 1 AND deleted > ?;

PostgreSQL

WITH most_recent AS (
    SELECT
        *,
        ROW_NUMBER() OVER (PARTITION BY id ORDER BY modified DESC) AS row_rank
    FROM
        widgets_versions
    WHERE
        modified <= ?
)
SELECT
    id, name, modified, deleted
FROM
    most_recent
WHERE
    row_rank = 1 AND deleted > ?;

These queries work by generating a ranked or ordered list of all historical versions up to the given token value, selecting only the most recent version, and then dropping any record that was deleted in the most recent change. The resulting set is then a snapshot of the database at the point in time represented by the transaction token.

Transaction Commit Tokens, Not Identity Tokens

Soft deletes and versioning provide the foundation for generating a snapshot. However, the transaction identity tokens presented for at-least-as consistency aren’t sufficient for exactly-as. The key issue is that identity tokens represent a point in time within a transaction, such as the transaction start time, rather than the time of commit. This means that identity tokens cannot be compared with each other because they do not represent the order of committed changes in a way that results in a consistent snapshot. For example, using identity tokens with exactly-as can result in the following critical fault:

A sequence showing concurrent transactions resulting in inconsistent at-least-as query results.

In this scenario, two API users create a widget around the same time. The transaction that starts first is assigned a token value lower than the transaction that starts second, as expected. However, the second transaction commits first due to an arbitrary timing condition. As a result, the larger transaction token is committed to the primary and replicated before the smaller token value. This results in a condition where two subsequent exactly-as queries using the same transaction token return different results because the snapshot selection is based on comparisons of transaction tokens that aren’t ordered according to when they were actually applied to the primary database.

This demonstrates that exactly-as consistency is impossible when using transaction identity tokens because they are not sortable in a way that enables a consistent snapshot selection. This creates a challenging situation because soft deletes and row versioning both require setting token values within a transaction where the final commit order cannot yet be known. One solution to this problem is to use transaction identity tokens as before but combine them with a commit ordered token value that is generated by serializing writes to a secondary token table.


MySQL	`CREATE TABLE transaction_identity_tokens ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ordering BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615 ); CREATE TABLE transaction_commit_tokens ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY );`
PostgreSQL	`CREATE TABLE transaction_identity_tokens ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, ordering BIGINT NOT NULL DEFAULT 9223372036854775807 ); CREATE TABLE transaction_commit_tokens ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY );`

MySQL

CREATE TABLE transaction_identity_tokens (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    ordering BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615
);
CREATE TABLE transaction_commit_tokens (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
);

PostgreSQL

CREATE TABLE transaction_identity_tokens (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    ordering BIGINT NOT NULL DEFAULT 9223372036854775807
);
CREATE TABLE transaction_commit_tokens (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
);

In this example, the identity and commit token tables have the same primary key structure and generate primary key values using the same auto-incrementing feature. The major difference is that the identity token table now has a column that will contain a commit order value that will be valid for greater than and less than comparisons and can be used to generate snapshots.

Generating the commit order tokens, then, is the key behavior needed. The commit order cannot be known from within a transaction but you can control the commit order using locks to serialize access to the commit token table.


MySQL	`START TRANSACTION; INSERT INTO transaction_identity_tokens VALUES (); SELECT LAST_INSERT_ID(); INSERT INTO widgets (id, name, modified) VALUES (1, "one", ?); SELECT GET_LOCK("commit-order", 10); INSERT INTO transaction_commit_tokens VALUES (); SELECT LAST_INSERT_ID(); UPDATE transaction_identity_tokens SET ordering = ? WHERE id = ?; COMMIT; SELECT RELEASE_LOCK('commit-order');`
PostgreSQL	`BEGIN; INSERT INTO transaction_identity_tokens DEFAULT VALUES RETURNING id; INSERT INTO widgets (id, name, modified) VALUES (1, 'one', ?); LOCK TABLE transaction_commit_tokens IN EXCLUSIVE MODE; INSERT INTO transaction_commit_tokens DEFAULT VALUES RETURNING id; UPDATE transaction_identity_tokens SET ordering = ? WHERE id = ?; COMMIT;`

MySQL

START TRANSACTION;
INSERT INTO transaction_identity_tokens VALUES ();
SELECT LAST_INSERT_ID();
INSERT INTO widgets
    (id, name, modified)
    VALUES
    (1, "one", ?);

SELECT GET_LOCK("commit-order", 10);
INSERT INTO transaction_commit_tokens VALUES ();
SELECT LAST_INSERT_ID();
UPDATE transaction_identity_tokens
    SET ordering = ?
    WHERE id = ?;
COMMIT;
SELECT RELEASE_LOCK('commit-order');

PostgreSQL

BEGIN;
INSERT INTO transaction_identity_tokens
    DEFAULT VALUES
    RETURNING id;
INSERT INTO widgets
    (id, name, modified)
    VALUES
    (1, 'one', ?);

LOCK TABLE transaction_commit_tokens IN EXCLUSIVE MODE;
INSERT INTO transaction_commit_tokens
    DEFAULT VALUES
    RETURNING id;
UPDATE transaction_identity_tokens
    SET ordering = ?
    WHERE id = ?;
COMMIT;

The exact locking mode will vary by SQL implementation but must have two specific qualities. The first is that the lock provides exclusive write access to the commit token table. The second is that the lock is not released until the transaction is committed. These qualities guarantee that the token value created at the end of the transaction represents the largest token value in the table at the time of commit. This means that the token accurately represents the point in time of when the changes were committed with respect to all other transactions and can be used without risking the out of order replica condition shown earlier.

The final token value returned from a modification becomes the commit order token rather than the identity token. When the commit order token is returned then the snapshot generation process can now leverage it by joining records to the transaction identity table:


MySQL	WITH most_recent AS ( SELECT widgets_versions., ROW_NUMBER() OVER ( PARTITION BY widgets_versions.id ORDER BY transaction_identity_tokens.ordering DESC ) AS row_rank FROM widgets_versions JOIN transaction_identity_tokens ON widgets_versions.modified = transaction_identity_tokens.id WHERE transaction_identity_tokens.ordering <= ? ) SELECT most_recent. FROM most_recent JOIN transaction_identity_tokens ON most_recent.deleted = transaction_identity_tokens.id WHERE row_rank = 1 AND transaction_identity_tokens.ordering > ?;
PostgreSQL	WITH most_recent AS ( SELECT widgets_versions., ROW_NUMBER() OVER ( PARTITION BY widgets_versions.id ORDER BY transaction_identity_tokens.ordering DESC ) AS row_rank FROM widgets_versions JOIN transaction_identity_tokens ON widgets_versions.modified = transaction_identity_tokens.id WHERE transaction_identity_tokens.ordering <= ? ) SELECT most_recent. FROM most_recent JOIN transaction_identity_tokens ON most_recent.deleted = transaction_identity_tokens.id WHERE row_rank = 1 AND transaction_identity_tokens.ordering > ?;

MySQL

WITH most_recent AS (
    SELECT
        widgets_versions.*,
        ROW_NUMBER() OVER (
            PARTITION BY widgets_versions.id
            ORDER BY transaction_identity_tokens.ordering DESC
        ) AS row_rank
    FROM
        widgets_versions
    JOIN transaction_identity_tokens
        ON widgets_versions.modified = transaction_identity_tokens.id
    WHERE
        transaction_identity_tokens.ordering <= ?
)
SELECT
    most_recent.*
FROM
    most_recent
JOIN transaction_identity_tokens
    ON most_recent.deleted = transaction_identity_tokens.id
WHERE
    row_rank = 1 AND transaction_identity_tokens.ordering > ?;

PostgreSQL

WITH most_recent AS (
    SELECT
        widgets_versions.*,
        ROW_NUMBER() OVER (
            PARTITION BY widgets_versions.id
            ORDER BY transaction_identity_tokens.ordering DESC
        ) AS row_rank
    FROM
        widgets_versions
    JOIN transaction_identity_tokens
        ON widgets_versions.modified = transaction_identity_tokens.id
    WHERE
        transaction_identity_tokens.ordering <= ?
)
SELECT
    most_recent.*
FROM
    most_recent
JOIN transaction_identity_tokens
    ON most_recent.deleted = transaction_identity_tokens.id
WHERE
    row_rank = 1 AND transaction_identity_tokens.ordering > ?;

These query examples work identically to the previous examples from the versioning section except that they rank and compare versions by their actual commit order rather than their identity order. This combines with the lock based serialization of writes to the commit token table to provide for consistent snapshot selection. If a replica contains a given transaction commit token then it necessarily contains all transaction with a lower value due to the write serialization.

A drawback of this approach is that locks can present a complication for performance in high throughput systems. In effect, the locks serialize all transaction commit token generation by causing concurrent attempts to generate tokens to enqueue. The impact is reduced by placing the lock only at the end of a transaction and only covering the commit token table insert with the lock. The impact on reads can also be reduced if the database supports a lock that grants write exclusivity but still allows reads. However, serializing the token generation so that the values reflect commit order is the only general purpose way to provide safety against false exactly-as results from replicas if consistency checks use greater than or less than comparisons of tokens.

Operational Considerations

The schema and query examples I’ve shown so far are intended to be functional demonstrations of the concepts but I’ve intentionally avoided diving too deep into operational or performance optimizations for any specific kind of database. Ultimately, the specific query optimizations, index definitions, and maintenance configurations will need to be tailored to your specific database and data. Even so, there are some generic operational conditions that you are likely to run into regardless of the database in use.

Avoid Load Balanced Replica Pools

A common practice I’ve seen for read replicas is to bundle a series of replicas under a load balanced endpoint or connection pool. In this model, there’s an indeterminate number of replicas that present as a single interface so that the routing or load balancing is invisible to the application that is built on the database. This practice can work well if the only consistency options are strong or fastest because those can be cleanly mapped to either the primary or a random replica. However, the at-least-as and exactly-as consistency options require inspection of individual replicas and pinning to a qualified replica based on a given transaction token.

My recommendation for this is to wrap individual replica connections in the same interface as the primary and implement a custom routing protocol based on a given consistency requirement and token. To illustrate, here’s a modified version of the running API example that separates the read and write methods and includes a new API method for checking the token state of a database instance:

service WidgetReader {
    rpc Read(ReadRequest) returns (WidgetResponse)
    rpc HasToken(TransactionToken) returns (Empty)
}
service WidgetWriter {
    rpc Create(Widget)    returns (WidgetResponse)
    rpc Update(Widget)    returns (WidgetResponse)
    rpc Delete(Widget)    returns (WidgetResponse)
}
message Widget {
    int64  id    = 1;
    string name  = 2;
}
message ReadRequest {
    int64 id = 1;
    ConsistencyLevel level = 2;
    TransactionToken token = 3;
}
enum ConsistencyLevel {
    FASTEST = 0;
    STRONG = 1;
    AT_LEAST_AS = 2;
}
message WidgetResponse {
    Widget widget = 1;
    TransactionToken token = 2;
}
message TransactionToken {
    string value = 1;
}

The connections to the primary and each individual replica can then be wrapped in the read interface and managed by a router that selects which instance to use based on the given consistency parameter and token value.

A high level overview of a routing node that is connected to a primary and three replicas. The primary and each replica implement the same reader interface.

The strong and fastest consistency options can easily be pinned to the primary or a random replica, respectively. At-least-as and exactly-as queries can then be routed by checking for the presence of a token in a random replica or selection of replicas.

A sequence showing the routing of an at-least-as consistency query. A user requests a read with a specified transaction token. The router selects a random quorum of replicas and checks if they have the given token. One of the replicas has the token so it is given the query and the results are returned to the user.

In this scenario, the router handling an at-least-as consistency query selects a random quorum of replicas and checks if they have the target token. The first replica to confirm that it has the token is given the query to process and the results are proxied to the user. This is certainly not the only way that the router can fairly select or load balance queries. It’s also possible to implement routing logic using direct database connections rather than encapsulating them in some kind of API or custom interface. The specific details will be unique to your implementation but the concept of treating replica connections individually is an important operational quality to consider. Each replica may synchronize with the primary at different rates so they each need to be treated as an individual, stateful system.

Managing Additional Storage For At-Least-as

The transaction tokens examples for at-least-as are either 32 or 64 bit integers which may represent a negligible amount of storage even over the entire lifetime of some systems. However, high activity systems using indexed, unsigned, 64 bit integers can experience an effectively unbounded amount of storage usage if they retrain tokens forever.

One way to manage at-least-as token storage is to periodically clear tokens beyond a certain age from the token table. Any deleted tokens are then no longer available for at-least-as checks. The tokens, themselves, aren’t comparable or naturally sorted in a way that could support deleting tokens below a certain numerical value so you’ll need to augment the token table with a timestamp to represent the creation time.


MySQL	`CREATE TABLE transaction_tokens ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, created DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP );`
PostgreSQL	`CREATE TABLE transaction_tokens ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, created TIMESTAMP NOT NULL DEFAULT NOW() );`

MySQL

CREATE TABLE transaction_tokens (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    created DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP
);

PostgreSQL

CREATE TABLE transaction_tokens (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    created TIMESTAMP NOT NULL DEFAULT NOW()
);

From here you might have a periodic process that deletes all tokens older than a certain offset from the current time, such as a day or a week. This controls the token storage but does have the side-effect of the system failing at-least-as checks for deleted tokens. Once a token is deleted then all at-least-as queries using that token will either fail or be diverted to the primary depending on how you’ve implemented the fallback rules.

If you want to detect and reject deleted tokens with a specific message then one technique is to structure the format of transaction tokens your system gives to callers as some form of tuple that includes both the token and the creation timestamp. When returned for a query, the timestamp can then be compared to the minimum timestamp in the token table or a threshold value stored separately by the periodic process that handles the token deletion.

Managing Additional Storage For Exactly-As

The example design for exactly-as relies on both persistent storage of transaction tokens and a complete history of changes for all records. The need to store all historical versions of mutable records and to retain deleted records easily creates a storage requirement that is larger than the primary data stored in the system. A high activity system likely must place some limitation on the effective lifetime of a token and the total amount of historical data.

One way to manage retention is to associate each token with the time at which it is created and run a periodic process to purge consistency related records that are older than a given time. For example, the token tables can be augmented with a timestamp field and indicator of cleanup:


MySQL	`CREATE TABLE transaction_identity_tokens ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY, ordering BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615, created DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP, removed BOOLEAN NOT NULL DEFAULT FALSE ); CREATE TABLE transaction_commit_tokens ( id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY );`
PostgreSQL	`CREATE TABLE transaction_identity_tokens ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY, ordering BIGINT NOT NULL DEFAULT 9223372036854775807, created TIMESTAMP NOT NULL DEFAULT NOW(), removed BOOLEAN NOT NULL DEFAULT FALSE ); CREATE TABLE transaction_commit_tokens ( id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY );`

MySQL

CREATE TABLE transaction_identity_tokens (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY,
    ordering BIGINT UNSIGNED NOT NULL DEFAULT 18446744073709551615,
    created DATETIME(6) NOT NULL DEFAULT CURRENT_TIMESTAMP,
    removed BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE transaction_commit_tokens (
    id BIGINT UNSIGNED NOT NULL AUTO_INCREMENT PRIMARY KEY
);

PostgreSQL

CREATE TABLE transaction_identity_tokens (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY,
    ordering BIGINT NOT NULL DEFAULT 9223372036854775807,
    created TIMESTAMP NOT NULL DEFAULT NOW(),
    removed BOOLEAN NOT NULL DEFAULT FALSE
);
CREATE TABLE transaction_commit_tokens (
    id BIGINT GENERATED BY DEFAULT AS IDENTITY PRIMARY KEY
);

A periodic task can then calculate some threshold value, such as a day or a week prior to the current time, select the maximum token value below that threshold, delete all historical versions for records deleted at or before that token, delete all historical versions of active records that are both equal or less than the token and are also not the latest versions, and finally mark the token as removed. The queries for selecting records to delete will be of a similar complexity to the exactly-as snapshot selection and allow the system to retain only the minimum historical versions needed to respond to future queries.

One complication to manage here is that deleting tokens from the token table in the example design will leave any active records that were last modified at or before the deleted token orphaned in a way that breaks snapshotting. In the example schema and queries, each historical version is joined to the transaction identity token table in order to access the commit order token. This join is no longer possible once the token records are deleted. One way to resolve this complication is to adjust the token value associated with live records at or below the deleted threshold so they match the minimum token that remains in the system.

Exactly-As And Locking

The exactly-as consistency concept requires a commit order value in order to produce valid and consistent snapshots of historical points in time. The only way to generate guaranteed commit order values in classic SQL is to use locking. Locks in a high activity system can have a large and negative effect on system performance. Unfortunately, there are few ways to mitigate this impact.

If you are starting a new project that requires exactly-as consistency in addition to high activity, especially a high amount of concurrent writes, then you should consider some of the SQL compatible but distributed databases that have been developed. These databases are sometimes described as NewSQL^[12] and include systems like Spanner^[13] or CockroachDB^[14]. These systems are designed to provide seamless horizontal scaling of read and write traffic while preserving the standard strong consistency guarantees of a SQL database.

Some NewSQL databases offer lock-free alternatives to generating commit order values. For example, Spanner’s spanner.commit_timestamp() function has the unique property of producing monotonic values associated with commit order^[15] and CockroachDB can similarly order timestamp generation for certain, overlapping transactions^[16]. Leveraging these behaviors instead of locks, combined with the scaling characteristics of these systems, can better support exactly-as consistency in high activity systems.

Summary

I’ve heard variations of "we can’t read from a replica here because we need read-after-write" enough times to be convinced that tunable consistency is a generally useful framework. Shifting the choice of consistency from a statically bound code path or API endpoint to a client facing parameter allows for greater adaptation to new or different use cases as they present.

Existing systems built on read replicas may already have internal concepts that are equivalent to the "strong" or "fastest" consistency options because they are natural reflections of the primary/replica split. The at-least-as consistency concept I’ve written about here, though, is the missing key to directing the majority of read operations to a replica without sacrificing the expectations of a user. The implementation for at-least-as is also lightweight. Adding a single table with a single column and inserting into it at the beginning of each transaction is enough to enable at-least-as consistency. No other changes to existing schema are required which makes it potentially easy to retrofit a live system for tunable consistency.

Highly regulated or specialized use cases can also incorporate an exactly-as consistency option that allows clients to query specific snapshots in time. Unlike at-least-as, the implementation of exactly-as is intrusive and likely operationally challenging. I don’t recommend exactly-as unless absolutely necessary but it’s possible to build on nearly any classic SQL database.

1. Classic SQL: I use the term "classic SQL" to differentiate older and well known implementations, such as MySQL and PostgreSQL, from more recent iterations in relational databases that are now referred to as NewSQL. For a breakdown of the terms see → Original(https://virtual-dba.com/blog/differences-between-sql-nosql-and-newsql/) ⇒ Archive(https://web.archive.org/web/20240912200237/https://virtual-dba.com/blog/differences-between-sql-nosql-and-newsql/)

2. PostgreSQL Streaming Replication → Original(https://wiki.postgresql.org/wiki/Streaming_Replication) ⇒ Archive(https://web.archive.org/web/20241016011614/https://wiki.postgresql.org/wiki/Streaming_Replication)

3. MySQL GTID Replication → Original(https://dev.mysql.com/doc/refman/8.4/en/replication-gtids.html) ⇒ Archive(https://web.archive.org/web/20241005175351/https://dev.mysql.com/doc/refman/8.4/en/replication-gtids.html)

4. Strong Consistency → Original(https://en.wikipedia.org/wiki/Strong_consistency) ⇒ Archive(https://web.archive.org/web/20241208050255/https://en.wikipedia.org/wiki/Strong_consistency)

5. Eventual Consistency → Original(https://en.wikipedia.org/wiki/Eventual_consistency) ⇒ Archive(https://web.archive.org/web/20241210094600/https://en.wikipedia.org/wiki/Eventual_consistency)

6. Monolithic Code → Original(https://en.wikipedia.org/wiki/Monolithic_application) ⇒ Archive(https://web.archive.org/web/20241130103701/https://en.wikipedia.org/wiki/Monolithic_application)

7. Modular Monolith → Original(https://www.milanjovanovic.tech/blog/what-is-a-modular-monolith) ⇒ Archive(https://web.archive.org/web/20241213192727/https://www.milanjovanovic.tech/blog/what-is-a-modular-monolith)

8. Microservice → Original(https://en.wikipedia.org/wiki/Microservices) ⇒ Archive(https://web.archive.org/web/20241207041643/https://en.wikipedia.org/wiki/Microservices)

9. REST → Original(https://en.wikipedia.org/wiki/REST) ⇒ Archive(https://web.archive.org/web/20241203014907/https://en.wikipedia.org/wiki/REST)

10. gRPC → Original(https://grpc.io/) ⇒ Archive(https://web.archive.org/web/20241202165832/https://grpc.io/)

11. SQL Window Function → Original(https://en.wikipedia.org/wiki/Window_function_(SQL)) ⇒ Archive(https://web.archive.org/web/20241218043342/https://en.wikipedia.org/wiki/Window_function_(SQL))

12. NewSQL → Original(https://en.wikipedia.org/wiki/NewSQL) ⇒ Archive(https://web.archive.org/web/20241203003958/https://en.wikipedia.org/wiki/NewSQL)

13. Spanner → Original(https://en.wikipedia.org/wiki/Spanner_(database)) ⇒ Archive(https://web.archive.org/web/20241210060520/https://en.wikipedia.org/wiki/Spanner_(database))