UUIDs are 128-bit labels that usually become an alternative to serial IDs for identifying resources. Since the risk of generating duplicated UUIDs is negligible and their generation does not require a centralized authority, their use is widespread in distributed systems.
Over the years, I have seen UUIDs used in multiple different ways:
- As primary keys used to identify resources.
- As external IDs, shared with third parties, on tables that already have a sequential, primary key.
- As alternative IDs, used to prevent information disclosure (a URL like
/orders/3413/track
more often than not reveals how many orders have been placed in the platform already). - As alternative IDs, used instead of sequential IDs to add entropy to URLs and avoid implementing a proper authentication mechanism. Please, just don’t.
- As parameter-digests to guarantee idempotency in certain operations.
Regardless of the use case, quite often we need to keep these identifiers in persistent storage. Using the right approach can be critical to guarantee the stability and performance of a system.
UUIDs in Postgres
The postgres documentation states the following about the UUID type:
The data type uuid stores Universally Unique Identifiers (UUID) as defined by RFC 4122, ISO/IEC 9834-8:2005, and related standards. (Some systems refer to this data type as a globally unique identifier, or GUID, instead.) This identifier is a 128-bit quantity that is generated by an algorithm chosen to make it very unlikely that the same identifier will be generated by anyone else in the known universe using the same algorithm. Therefore, for distributed systems, these identifiers provide a better uniqueness guarantee than sequence generators, which are only unique within a single database.
However, I have seen them stored as VARCHAR
columns more often than I’d like to admit. In the early days, before I actually knew what a UUID really is, I made this mistake as well, so nobody is to be blamed. When I realized there is a UUID type in Postgres, I started using it, just because it felt like the right tool to store this type of data. However, today I wanted to measure what is the impact of this mistake.
In my opinion, from the application perspective, there is a very obvious risk to be addressed: UUID columns that don’t actually (or not always) store UUIDs. Why would someone do that? Most of the time it is not intentional. An application is designed to store UUIDs, then there is a bug that accidentally stores integer IDs to the VARCHAR
column (which of course is very willing to keep this data), and when we realize, it is too late (or expensive) to repair the inconsistency of the column.
I have also seen applications implementing their own interpretation of UUIDs (as if languages did not have built-in functions to do so), resembling UUIDs but without actually being valid UUIDs according to RFC 4122. All sorts of compatibility nightmares should be expected from this approach.
Experiment
From the database perspective, I have conducted a little experiment to understand the magnitude of the problem. I have created two tables with one column each: a primary key named id
. This column is a VARCHAR
in one case, and a UUID
in another case:
CREATE EXTENSION IF NOT EXISTS "uuid-ossp";
CREATE TABLE uuid_number(id UUID);
CREATE TABLE uuid_varchar(id VARCHAR);
Then, I inserted 1M rows in each table, with randomly-generated UUIDs:
INSERT INTO uuid_number (id)
SELECT uuid_generate_v4() from generate_series(1, 1000000);
INSERT INTO uuid_varchar (id)
SELECT uuid_generate_v4() from generate_series(1, 1000000);
We can easily compare the sizes of the tables:
SELECT pg_size_pretty( pg_table_size('uuid_number')) AS uuid_number,
pg_size_pretty( pg_table_size('uuid_varchar')) AS uuid_varchar;
which, not shockingly, reports more space for the VARCHAR
table (65MB) than for the UUID
table (42MB).
Then I added an index to these columns, and compared the sizes:
CREATE INDEX ON uuid_number(id);
CREATE INDEX ON uuid_varchar(id);
select pg_size_pretty(pg_indexes_size('uuid_number')) AS index_number,
pg_size_pretty(pg_indexes_size('uuid_varchar')) AS index_varchar;
In this case, the size of the index was 56MB in the VARCHAR
table VS 30MB in the UUID
column.
From these results, the immediate conclusion I get (without being a Postgres expert) is that using VARCHAR
columns to store UUIDs takes significantly more space, impacting performance. The more bytes Postgres has to move around, read, compare, transform, etc., the more resources (time, memory, and CPU included) it is going to take.
Is this a critical issue? Well, every situation is different. This may not be relevant in a low-volume database that has been running for ages and nobody is actually evolving. This may be critical in a codebase actively maintained by dozens of developers. This may be critical in a highly stressed database that has to deal with this overhead due to bad decisions during the early stages of development. It’s up to you and your organization to determine what is worth working on.
Bonus track
I started this blog very recently (this is my second article already ✌️) and since the very beginning, I want this to be a tool to learn and share. I force myself to read documentation, experiment, and put my thoughts together so I can explain to someone else. During this process, I reinforce some of my beliefs, prove myself wrong (I thought the performance difference between UUID
and VARCHAR
would be much bigger), and find new questions I don’t have an answer for.
This is not all that needs to be known about UUIDs and performance. I am sure that more skilled professionals will have a deeper understanding of the topic, and I encourage you all to keep researching and learning as I am doing now.
Happy coding!