ActiveRecord is, in my opinion, one of the most powerful tools bundled with Rails. Dealing with ruby objects and collections instead of plain rows and SQL queries is very convenient in most cases. The query interface is just wonderful. However, we sometimes find a conflict between what we would like to do in an object-oriented language and what the database demands to stay efficient.

In today’s article, we will tweak ActiveRecord to define a read-only attribute that is calculated as an aggregated function on another table.

Original scenario

Let’s define our pet project as a very simple invoice application. There is nothing super special about the setup, so we will not stop at the details. The database schema is very simple: we have an invoices table and a payments table where we store the payments associated with an invoice:

ActiveRecord::Schema[7.0].define(version: 2023_10_01_130449) do
  # These are extensions that must be enabled in order to support this database
  enable_extension "plpgsql"

  create_table "invoices", force: :cascade do |t|
    t.text "summary"
    t.string "name", null: false
    t.datetime "supplied_at", null: false
    t.datetime "invoiced_at", null: false
    t.integer "total_in_cents", null: false
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.check_constraint "total_in_cents > 0", name: "positive_amount"
  end

  create_table "payments", force: :cascade do |t|
    t.bigint "invoice_id", null: false
    t.integer "amount_in_cents", null: false
    t.datetime "created_at", null: false
    t.datetime "updated_at", null: false
    t.index ["invoice_id"], name: "index_payments_on_invoice_id"
  end

  add_foreign_key "payments", "invoices"
end

Our Invoice model has_many :payments and our Payment model belongs_to :invoice. Nothing fancy so far.

Defining a seeds file is very convenient to play around with our application without needing to create test data every time:

# db/seeds.rb

Payment.delete_all
Invoice.delete_all

100.times do |i|
  invoice = Invoice.create!(
    summary: 'Invoice summary',
    name: "Customer #{i}",
    total_in_cents: 100 * rand(50..200),
    supplied_at: Time.zone.now,
    invoiced_at: Time.zone.now
  )

  invoice.payments.create!(amount_in_cents: 30_00)
  invoice.payments.create!(amount_in_cents: 20_00)
  invoice.payments.create!(amount_in_cents: invoice.total_in_cents - 50_00)

  invoice.payments.create!(amount_in_cents: 10_00) if (i % 7).zero?
end

We have also defined a method in our Invoice model to find out its status. Four possible statuses are considered:

paid when the sum of the payments matches the total amount.
unpaid when there is no payment associated with the invoice.
underpaid when the sum of the payments is less than the total amount.
overpaid when the sum of the payments is more than the total amount.

Our Invoice#status looks like this:

# app/models/invoice.rb

def status
  case payments.sum(:amount_in_cents)
  when total_in_cents
    :paid
  when 0
    :unpaid
  when 1...total_in_cents
    :underpaid
  else
    :overpaid
  end
end

And we can use it like:

3.2.0 :001 > Invoice.first.status
  Invoice Load (0.4ms)  SELECT "invoices".* FROM "invoices" ORDER BY "invoices"."id" ASC LIMIT $1  [["LIMIT", 1]]
  Payment Sum (0.4ms)  SELECT SUM("payments"."amount_in_cents") FROM "payments" WHERE "payments"."invoice_id" = $1  [["invoice_id", 1]]
 => "overpaid" 

Refactoring target

The status method certainly works, and that is the least we can ask of a method. We even have unit tests to make sure it works as expected. Everything is all right. However, after some years with the application running, we found the following piece of code was using an abnormal amount of memory:

def overpayments
  Invoice.all.select{ |invoice| invoice.status == "overpaid" }.first(10)
end

This method selects the first ten invoices that are overpaid. However, it is loading all of them into memory just to take a few of them.

Having lost all context about how the model is built, my first instinct was to replace the select filter with a where(status: :overpaid) clause, and the first(10) by a limit(10). This way, the database would take care of filtering the records for us, and the application would get exactly what it needs. However, I failed to realize that we are filtering the records in memory because the status of the invoice is evaluated in memory — it is not persisted in the table. We cannot apply the limit clause to the query, as we don’t know how many records are going to be filtered out.

I had the strong feeling that this was not something we should be doing in memory. But I also didn’t have a straightforward way to solve this. A couple of ideas came to mind:

Create a view that aggregates the payments associated with the invoice and returns the status. While this is the most straightforward, we would not be able to use our ActiveRecord models to query the view. Besides that, we don’t need the invoice status every time we load an object, so loading objects from the view by default would be overkill. We would also have some code duplication, as the status definition would be present in the ruby code as well as in the view definition.
Dynamically define the status attribute on the query result by using something like Invoice.select("*", "'foo' as status"). The Invoice objects returned by this query will respond to the status method just as if it were another column in the table. The status definition would still be duplicated in Ruby and SQL, though.
Pretend I did not see that code and move forward, which looked appealing for a split second.

None of these ideas solved the entire problem, so I decided to adapt the second one trying to remove that definition duplication.

Defining the status attribute with a scope

Since I wanted to evaluate the case-when structure in the database, I thought it would be a good idea to define a function using Pl/pgSQL. That would make the application code easier to understand, so I created the following migration:

class DefineInvoiceStatusFunction < ActiveRecord::Migration[7.0]
  def up
    command = <<~SQL
      CREATE OR REPLACE FUNCTION invoice_status(total_in_cents integer, total_paid bigint) RETURNS varchar AS $$
      BEGIN
        RETURN CASE
          WHEN total_paid = total_in_cents THEN 'paid'
          WHEN total_paid = 0 THEN 'unpaid'
          WHEN total_paid > total_in_cents THEN 'overpaid'
          ELSE 'underpaid'
        END;
      END;
      $$ LANGUAGE plpgsql;
    SQL

    ActiveRecord::Base.connection.execute(command)
  end

  def down
    ActiveRecord::Base.connection.execute('DROP FUNCTION IF EXISTS invoice_status(integer, bigint);')
  end
end

We can easily test our function with plain old SQL:

SELECT
  invoice_status(10, 0),
  invoice_status(10, 10),
  invoice_status(10, 5),
  invoice_status(10, 20)

-- unpaid, paid, underpaid, overpaid

The next step was defining a scope that defines a new attribute with the total amount paid per policy.

# app/models/invoice.rb

class Invoice < ApplicationRecord
  PAYMENTS_SUBQUERY = Payment.group('invoice_id')
                             .select(:invoice_id, 'SUM(amount_in_cents) as total_paid')
                             .to_sql
                             .freeze

  has_many :payments

  scope :with_total_paid, lambda {
    select('invoices.*')
      .joins("LEFT OUTER JOIN (#{PAYMENTS_SUBQUERY}) subq ON invoices.id = subq.invoice_id")
      .select('subq.total_paid as total_paid')
  }
  
  # ... 
end

I used a subquery to get the total amount paid by invoice. Then I used a LEFT OUTER JOIN with the invoices table to include this new column in the result. It is important to notice how this scope selects all columns in the invoices table (to create the regular Invoice object) and then selects an additional value from the subquery and returns it as total_paid. This attribute can be accessed as any other method in our model:

3.2.0 :001 > Invoice.with_total_paid.limit(2).map(&:total_paid)
 => [6800, 12600]

We are still not getting our payment status, but we are getting closer. With this scope available, we can easily define a new scope that evaluates invoices_status to give us what we are expecting. I had to remove the original status definition, as this method is now defined dynamically when the result is returned. I also included an optional parameter to filter by one or more statuses:

# app/models/invoice.rb

scope :with_status, lambda { |statuses = nil|
  query = with_total_paid.select('invoice_status(total_in_cents, total_paid) as status')

  if statuses
    query = query.where(
      'invoice_status(total_in_cents, total_paid) IN (?)',
      Array(statuses)
    )
  end

  query
}

This new scope can be used like:

3.2.0 :001 > Invoice.with_status.limit(2).map(&:status)
 => ["overpaid", "paid"] 
3.2.0 :002 > Invoice.with_status(:overpaid).limit(2).map(&:status)
 => ["overpaid", "overpaid"]
3.2.0 :003 > Invoice.with_status([:overpaid, :paid]).limit(2).map(&:status)
 => ["overpaid", "paid"]

This solution solved one of our issues and allowed us to do the filter and limit in the database. The status is retrieved on demand (when the scope is used) in the same query as the other fields in the model. However, we have not solved the code duplication issue yet, as the status instance method is no longer defined in the model.

Reusing the status scope from an instance method

We have a way to load the status when we are retrieving a collection of objects, but our codebase may be loading the invoice objects from somewhere else (relations, direct load, etc.). We need to provide a replacement for the original status method in a way that we reuse the scope we have defined for the collection.

The first approach I tried was just defining the method as:

# app/models/invoice.rb

def status
  self.class.with_status.find_by(id:).status
end

However, we get into an infinite loop due to recursion. ActiveRecord will only define the status method introduced in the scope if the method is not defined already. Thus, by explicitly defining the status method in the model we broke the dynamic status method defined by the scope.

We have one last trick though. We need the status method to be defined for objects returned by the scope, and we want the status method to be defined differently for objects loaded without the scope. There is a little trick we can use to achieve our goal:

# app/models/invoice.rb

class Invoice
  # ...

  private

  def method_missing(*args)
    return super unless args.first == :status
    return super if respond_to?(:status)

    self.class.with_status.find(id).status
  end

We are using the method_missing method to intercept the status method calls on objects that don’t have this method defined. ActiveRecord uses a similar approach to define the status method dynamically, so we need to find out whether the method is going to be defined for this object by ActiveRecord or not. If it does not, we should define our implementation.

We can now use the status method as a regular instance method:

3.2.0 :001 > Invoice.first.status
 => "overpaid"

Final comments

If I had to implement from scratch a feature to get a list of invoices filtered by status, I would probably create a custom query object for that specific use case and not mess up with the existing model. However, the approach discussed in this article can be useful in those cases where the method has been defined for ages and is used all over the place.

The status-related parts of the invoices model (scopes and status instance method) may be refactored to a separate module to hide implementation details, which will enable us to easily define multiple aggregated attributes in the same model.

Happy coding!

Defining aggregated attributes with ActiveRecord