In today’s article, I’d like to cover a fairly common situation in relational database design. Yep, we are talking about columns that should have been marked as NOT NULL, but they were not.

This is the first article of a series I plan to write about the maintenance of Postgres databases used in Rails applications. I am somewhat new to this topic, so I will be discovering this amazing world as I do my research to write this article. I will try to make all experiments reproducible, so any reader who is interested in the topic can play around with the examples and learn from doing, not just from reading.


Problem statement

One of the most common database design mistakes I have seen (and committed) is not marking a column NOT NULL during the table creation. This restriction should not be applied mechanically to all columns, but more often than not we will find that we have columns where we don’t expect to find NULL values. Some day I may investigate how columns marked as NOT NULL improve the performance of queries, but for the time being, let’s point out a couple of benefits:

  1. Peace of mind. Postgres, constantly vigilant, will make sure that no NULL values get into your column. As developers, it is likely that at some point we will make a mistake and save some NULL value accidentally. Even if we are using a framework, there are always ways to bypass these restrictions.
  2. Follow Postgres recommendations. The bottom line of the official documentation is very clear, stating In most database designs the majority of columns should be marked not null.
  3. Help the query planner. Postgres will have additional information it can use to skip certain operations or table scans if it knows upfront it will not find NULL values on certain columns.

To do some experiments, I will create a new users table in my database. For the sake of simplicity, this table will just have an incremental primary key and an external ID, which is going to be the target of our investigation. Based on the data model of my application, this column should have been marked as NOT NULL, but for whatever reason it was not.

Any operation on a database is fast if the tables are small enough, but in this case, I want to simulate a real-world scenario where tables have several million records. In this case, I will create 1B records in the us`ers table:

DROP TABLE IF EXISTS USERS ;
CREATE TABLE USERS (ID SERIAL PRIMARY KEY, EXTERNAL_ID UUID);
INSERT INTO
  USERS (EXTERNAL_ID) 
  SELECT
    GEN_RANDOM_UUID() 
  FROM
    GENERATE_SERIES(1, 1000000000) -- 24m 39s

Our goal in this article is to revert the mistake of not marking the external_id column as NOT NULL.

Some notes on Postgres locks

I plan to write some deep articles on Postgres locks specifically, but for the purpose of this one, it is crucial to understand some basic concepts. When we execute SQL transactions in Postgres, the connection acquires one or more locks on the tables it needs to access. There are 8 types of table locks and 4 types of row locks (which are irrelevant in the scope of this article).

Depending on the operations that we are running on a table, Postgres will try to acquire one type of lock or another. For example, if we are just reading some rows from a table it will most likely acquire an AccessShareLock, but it will need a RowExclusiveLock if we are inserting rows. Other lock types apply for operations like TRUNCATE TABLE, VACCUM, etc. Even read-only operations use a certain type of lock.

Using a lock does not necessarily mean that other connections cannot use the table. Actually, that is not the case most of the time. When a connection needs to acquire a lock on one table, Postgres first checks if there is already another granted lock that has a conflict with the one being requested. If there is no conflict, the lock is granted straight away, but if there is a conflict, the later operation will be blocked until the former finishes and the conflicting lock is released.

The official documentation includes a table with conflicting lock modes. Let’s review a couple of examples to understand how locks work:

  1. A connection is doing an INSERT into a table (requires RowExclusiveLock) and another connection is reading information from the same table (requires ShareAccessLock). Since there is no conflict between those locks, both operations can happen simultaneously.
  2. A connection is creating an index not concurrently (requires ShareLock). Data can still be read (requires ShareAccessLock) but INSERT or UPDATE operations (require RowExclusiveLock) will be blocked.
  3. A connection is truncating a table (requires AccessExclusiveLock), and no other operation can use the table at the same time. This kind of lock is the most restrictive, being the only one that blocks SELECT operations.

Postgres can provide information about the locks that have been requested, and which of those have been granted. Locks that have not been granted are blocked by a conflicting lock on the same table. The following SQL statement will give us the locks on the users table:

SELECT RELNAME,
       MODE,
       GRANTED
FROM PG_LOCKS
INNER JOIN
  (SELECT OID,
          RELNAME
   FROM PG_CLASS) RELATIONS ON RELATIONS.OID = PG_LOCKS.RELATION
WHERE RELNAME = 'users'

During the insertion of the 1B rows, we will see that the users table has a granted RowExclusiveLock on it.

Fixing the problem

Option 1: Alter table and don’t look back

The first approach that comes to mind for most developers is running the following migration, which on my laptop took 1 minute and 7 seconds:

ALTER TABLE USERS ALTER COLUMN EXTERNAL_ID SET NOT NULL; -- 1m 7s

Postgres needs to scan the entire table to make sure all rows are already compliant with the new condition EXTERNAL_ID NOT NULL. If one of the rows happened to have NULL external ID, the command would fail. This unavoidable scan is the reason why this instruction requires some time on big tables.

This command acquires an AccessExclusiveLock, which has a conflict with any other operation using the table. Other operations will be delayed until the instruction above is done. Actually, right after starting the process I made a SELECT statement from another connection and checked the locks in the table. I found two logs:

  • An AccessExcluseiveLock, which was granted.
  • An AccessShareLock, which was not granted.

Only when the ALTER TABLE command had finished, the second connection was granted the AccessShareLock.

It is usually quite hard to predict how long certain statements will take to execute in the production environment. There are multiple factors involved that can increase or decrease significantly the execution time. Just to name a few:

  1. The physical resources of the database server. Very often, developers will not have a production-like server to test the performance of certain migrations.
  2. The activity of the database server. We not only have to consider the operations that will be blocked as a consequence of our migration. Our migration will be blocked by other connections that already acquired a conflicting lock. For example, if there is a long-running SELECT statement that takes 30 minutes to finish, our migration to add a new column will have to wait.
  3. The size of the table. Most operations are executed in the blink of an eye on small tables. However, operations that need to scan a table with hundreds of millions of rows will certainly be slower. More often than not, the tables of the development environments of those writing the migrations are several orders of magnitude smaller. Just because a migration was fast in the development environment it doesn’t mean it will be fast in the production environment as well. My piece of advice would be:

  4. Identify which migrations are likely to take a significant amount of time. The documentation can provide some hints to understand when a table needs to be scanned. Some operations are unexpectedly slow, though. The strong_migrations gem is very useful in Rails applications.
  5. If you are unsure, try to reproduce a similar operation on a completely separate database in your local environment. Make sure the size of the table is significant. This is pretty much what I do when I prepare these articles.
  6. Understand which locks are needed to execute the migration. The documentation is pretty clear on this topic, but you can also use the tools provided in the prior section to reproduce the situation locally and check what locks are acquired.
  7. Use the lock compatibility table to discover which locks are incompatible with the ones the migration will need. That will indicate the operations that will be blocked during the migration.
  8. Analyze your specific scenario to understand if your application can afford to have certain kinds of queries blocked for a significant amount of time. If you are uncertain or you cannot safely estimate how long these will take, don’t do it or schedule a long enough downtime.
  9. Look for a workaround that provides a similar result without using aggressive locks for long periods of time. This article is an example of how to do exactly this.

Option 2: Play safe and avoid surprises

Provided that we cannot afford to have downtime in our application, we are forced to find a way to achieve the same result in a different way:

  1. Add a NOT VALID, NOT NULL constraint on the column. By using the NOT VALID option, Postgres will start checking the column for inserts and updates but will not scan unaltered existing rows, which makes the instruction very fast. Even though this operation uses an AccessExclusiveLock, it will be released very quickly, avoiding a big impact on your application.
  2. Validate the constraint, which basically means that Postgres scans the table to certify there is indeed no NULL value in that column.
  3. Replace the constraint check with a NOT NULL condition in the definition of the column. Since Postgres already knows there are no NULL values in the column, it will not need to rescan the entire table.

The steps above would be reflected in the following SQL statements:

ALTER TABLE USERS 
  ADD CONSTRAINT EXTERNAL_ID_NOT_NULL 
  CHECK(EXTERNAL_ID IS NOT NULL) 
  NOT VALID; -- 30ms

ALTER TABLE USERS 
  VALIDATE CONSTRAINT EXTERNAL_ID_NOT_NULL; -- 1m 17s

ALTER TABLE USERS
  ALTER COLUMN EXTERNAL_ID
  SET NOT NULL; -- 33ms

The validation of the constraint requires a ShareUpdateExclusiveLock instead of the original AccessExclusiveLock, which will allow concurrent reads and writes on the table. Other operations that change the structure of the table (adding columns, indexes, etc.) will also be blocked, but in most cases, these are planned migrations we can schedule for a different moment.

Using this approach, we may be avoiding an outage of a certain feature or the entire application!

Summary and learnings

I found writing this article very instructive, as I was forced to understand the incompatibilities between lock types. I also learned how to get the locks that have been granted on a table, and that may be blocking operations we would expect to be running.

The problem addressed in this article is very much something we find on a daily basis in many applications, which makes the solution easily applicable to real-world scenarios.

Finding answers to my initial questions led to additional questions I still need to find an answer to. What other processes can have a workaround to prevent excessive blocking?

Hopefully, these and other questions will find an answer in future articles of this series. In the meantime, happy coding!