Six Options for Generating Distributed Unique IDs

Overview

Identifying unique entities is a frequent requirement in software development. For instance, assigning a unique ID to each Ad impression enables us to link related events for billing and analysis. However, generating unique IDs becomes challenging when dealing with large distributed systems. In this survey, we explore various options and discuss their suitability. We also introduce the snowflake-id service, a distributed system for generating unique IDs based on the snowflake algorithm and utilizing the etcd.

Option 1: UUID

A Universally Unique Identifier (UUID) is a 128-bit number commonly represented and stored as a string of 36 hexadecimal digits, such as 123e4567-e89b-12d3-a456-426614174000. UUIDs can be generated within the application process without the need for an ID server, and the probability of generating duplicate IDs is extremely low. However, UUIDs are not suitable for distributed ID generation due to certain limitations.

  1. Inefficient for storage. The 36-byte string is 4.5X the size of int64, a common data type used for ID. Not only storage of the UUID itself is big, but indexes of UUID are also big.
  2. Inefficient for relational databases. UUIDs are considered “random” and do not offer the advantage of data stored in a particular order. This becomes problematic when creating B-tree indexes on UUIDs in relational databases because consecutive UUIDs are likely to be far apart when ordered.

Option 2: UUID v7

UUIDs are out of order because the most significant bits (MSB) is a random number, while the timestamp is stored in the lower bits. However, in the proposed new UUID format, UUIDv7, the timestamp is placed in the most significant bits. This arrangement ensures that UUIDv7 is ordered according to time. Despite UUIDv7-specific methods not being widely supported in most databases at present, it is OK to store and utilize UUIDv7 in databases as it retains the same number of bits and string representations as previous versions.

Option 3: Auto-Increment ID from databases

RDBMS supports auto-increment entities. For example, we can use a sequence in Postgres like this.

1create sequence seq_guid as bigint
2increment by 1
3start with 1;

When the application needs an ID, it calls Postgres select nextval('seq_guid');. If you uses Redis, you can use the INCR command. For example, incr seq_guid. Since RDBMS and Redis are often already integrated into the existing infrastructure, option 3 does not require additional DevOps work. Also, because sequence IDs are monotonic increasing, option 3 satisfies specific use cases that require monotonicity. However, there are some limitations of Option 2.

  1. Dependency on single writer node. In most cases, DB runs on a single server. When the DB becomes unavailable, the application cannot get new IDs. Availability of DB is often increased by having an active standby that replicates updates from the active server. However, the replication is asynchronous. Under high load, duplicate IDs will be returned if there is a lag in the standby replica.
  2. Not performing well because of high concurrency. Be it the Postgres sequence object or the Redis key, it is a single object. Each auto-increment ID is a write operation. That means, concurrent ID requests contend for the same resource and all requests are served in serial.

Option 4. Auto-Increment ID from multiple database objects.

We can scale Option 3 by using multiple database objects instead of one. For example, use 2 sequences: seq_guid_odd and seq_guid_even to generate odd and even IDs respectively. However, we don’t want the applications to worry about which sequence to use. Thus, we need to set up a service that hides the sequences from the application. IDs from different sequences do not guarantee the monotonicity. That is acceptable because, in most cases, IDs need to be in some order but not strictly monotonic order.

Option 4 reduces contention compared to Option 3, but it brings new scalability problem. When we need to scale out from 2 sequences to 3 sequences. Possible scaling steps are:

  1. Get the max ID generated, say x.
  2. Add a big constant C to x. C is a number that we are sure we’d generate less than C IDs during the time of scale.
  3. Create a new sequence and set its start to x+C and its increment step to 3.
  4. For the existing two sequences, change their starts to x+C and increment step to 3.

The scale-out becomes more complicated if we find out the single DB server cannot handle the load and we have to scatter sequences among different DB servers.

Option 5. Auto-increment by asking for a range of IDs from databases

We can enhance Option 3 by batching ID requests. Rather than requesting individual IDs, we can optimize the process by requesting a range of IDs and incrementing within that range. This approach, similar to Option 3, necessitates the implementation of a service to separate the application from the underlying database operations.

To prevent latency spikes during database queries, the service can employ a prefetching mechanism. This means that when the current range of IDs is 70% consumed, the service can proactively fetch the next range in advance. Read the segment mode of Meituan-Leaf for details.

Option 6. Distributed stateless ID service using the Snowflake algorithm

Option 3 and its variants rely on a centralized database for ensuring correctness, availability, and scalability. However, it is possible to generate IDs using a set of ID generating servers.

  1. Each server has an unique identity.
  2. Different servers generate different range of IDs.
  3. Servers collaborate with each other through a centralized coordinators to to ensure that no two servers simultaneously use the same identity.

What algorithm does each server run?

We choose the Snowflake ID algorithm. The algorithm generates increasing 64 bits integer in the following way.

  1. bit 1 (most significant bit): set to 0, indicate the ID is a positive number.
  2. bit 2-42: Unix epoch timestamp in milliseconds.
  3. bit 43-52: 10 bits to represent the server ID. The server id uniquely identify a process that generates Snowflake ID.
  4. bit 53-64: 12 bits. Sequence number that increment from 0 to 4095 for each millisecond.

With the settings above, each server can generate up to 4 million IDs per second until the year 2039. If we tweak step 2 to use a later epoch instead of the Unix epoch, we can generate IDs up to 69 years from today.

How to coordinate servers?

We use the etcd to coordinate servers. At a high level, the coordination works in the following way.

  1. There are two types of keys for each server. Assuming the server ID is x, there are two keys for the server.
    • the key that indicates whether a server is live: servers/x/live.
    • the key that records timestamp: servers/x/last_reported_ts.
  2. The liveness key is associated with a lease and the key is deleted when the lease expires. Periodically, the server refreshes the lease and updates the last_reported_ts.
  3. When a server starts, it searches for an unused liveness key. If it finds one and the corresponding last_reported_ts is smaller than the server’s timestamp, the server takes over both the liveness key and the timestamp key, subsequently joining the cluster.”

There are nuances to consider. For example, how to handle scenarios when a server remains disconnected for an extended period time and reconnects later? Check snowflake-id server for a complete implementation.

Which Option should you use?

Which option to use depends on two factors: the requirement of the ID and the engineering resources available. An ID is just that, an identifier. As long as we don’t encode application state into the ID, it should be easy to switch to a different option later on when we have more available resources. So, let’s look at a few scenarios with different ID requirements.

  1. If you just need unique IDs and is not concerned about ordering, use Option 1, UUID.
  2. If you need unique IDs that directionally increasing, and you are OK with UUID format, use Option 2, UUIDv7.
  3. If you need unique, monotonic IDs and your load is mild (thousands per second), use Option 3.
  4. If you need unique, monotonic IDs and your load is high, use Option 4 or 5.
  5. If you need unique IDs that are that directionally increasing, and your load is super high, and you not OK with UUID format, use Option 6.