Spend less time in the database, and more time outdoors!
“In today’s episode, Rick shares active_record_context, a small rails plugin extracted straight from Lighthouse that implements a simplistic ActiveRecord object caching system.”
If you were fortunate enough to catch DHH’s Railsconf 2007 Keynote, you would have heard about David and the Core Team’s (sounds like an indie band) plans for Rails 2.0. No major paradigm shifts this time, just tweaks, fixes, and omissions to make Rails a better framework. One of the new features is the automatic query caching. Rails will now cache the results of specific SQL SELECT queries for the duration of the request. If you happen to access User.find(1) twice, it’ll return a cache hit the second time. Why is this cool? First, a history ActiveRecord bloopers.
ActiveRecord gets you a lot of functionality “for free,” meaning a lot of the basic grunt work is taken care of for you if you follow the simple conventions. Here’s a totally contrived example that might be used for something resembling the Lighthouse project overview timeline:
@events = Event.find(:all, :order => 'created_at desc', :limit => 50)
@events.each do |event|
puts event.title
end
That’s great, but each event references a specific object, such as a Ticket that was just created:
# define the target association
class Event < ActiveRecord::Base
belongs_to :ticket
end
# display the recent events
@events = Event.find(:all, :order => 'created_at desc', :limit => 50)
@events.each do |event|
puts event.ticket.title
end
This works well, but Tickets are stored in another table. This means that when each event is being displayed, we’re making another query to the database to fetch the associated ticket. This was a pretty common problem when I started with Rails. At the time, “piggy back” queries were frequently used:
@events = Event.find(:all,
:select => 'events.*, tickets.title as ticket_title',
:joins => 'inner join tickets on events.ticket_id = tickets.id',
:order => 'created_at desc', :limit => 50)
@events.each do |event|
puts event.ticket_title
end
This was pretty fast, but required a lot of extra configuration. Now we had to specify an SQL statement, as well as list which extra fields specifically that we wanted, so that ActiveRecord could create dynamic attribute accessors for them. Luckly, eager loading was introduced:
@events = Event.find(:all, :include => :ticket, :order => 'created_at desc', :limit => 50)
@events.each do |event|
puts event.ticket.title
end
This constructed the join statement for you, and rebuilt the retrieved ticket object for you. The nice thing was, this only took one simple option, and the rest of the timeline code stays the same. However, ActiveRecord is notorious for generating some gnarly OUTER JOIN sql queries. Also, it spends extra time with the results and converting them to ActiveRecord objects. This isn’t a big deal since we’re now performing only 1 query, instead of n + 1 (51 in this case because we’re selecting 50 events).
This approach doesn’t work with Lighthouse, however. Events instead have a polymorphic “target” join that can be a Changeset, Message, Ticket, etc. So, there’s no easy way to eagerly load this content without joining to all of the tables. If you’ve seen some of the eagerly loaded SQL that ActiveRecord generates you’ll know why a four-table OUTER JOIN isn’t the best idea. The ActiveRecord query cache was designed to help out in situations with a lot of repetitive queries like this.
class Event < ActiveRecord::Base
belongs_to :target, :polymorphic => true
end
@events = Event.find(:all, :order => 'created_at desc', :limit => 50)
@events.each do |event|
puts event.target.title
end
Just like before, calling event.target will hit the database. However, query caching will make this more efficient, especially if you have a lot of events referencing the same targets. But, what if you have a lot of records, with a lot of associated records? That’s the situation I ran into with Lighthouse, and was the catalyst for writing the active_record_context plugin. The main benefit I get from this approach is to seed the cache before looping, so I can fetch all the targets at once.
@events = Event.find(:all, :order => 'created_at desc', :limit => 50)
# this fetches all the messages that are attached to at least 1 event
# since we're just seeding the cache, there's no need to store this value in a local or instance variable
Message.find_all_by_id(@events.select { |e| e.target_type == 'Message' }.collect(&:target_id).uniq)
# seed the cache for the other possible polymorphic models
@events.each do |event|
puts event.target.title
end
This pastie shows a before/after snippet of the development log as I applied this to the admin overview code for Mephisto (which is fairly similar to Lighthouse’s overview code). Look at how large and nasty the original query is, clocking in at 0.02 seconds (this was run on the main mephistoblog.com data. After the active_record_context code is added, the three queries take a total of .007 seconds to run. The key here is that the three queries are very simple with no joins, taking advantage of the existing primary key index of the tables.
So, what’s all this talk about “object caching being hard?” I happen to sidestep some of the more difficult use cases and questions that may come up. For instance, how should object caching handle this:
User.find 1
User.find :first, :conditions => "id = 1"
It’s pretty obvious from looking at it that we’ll be selecting the same record. But, there’s no simple way for the plugin to know that, short of parsing SQL itself. So, I lay down a simple ground rule so that no stale objects are harmed: if the query is not a simple fetch-by-ID (it uses eager includes and/or custom conditions), then don’t bother checking the cache. I’ve found that sticking to the ID helps out in enough places to be useful.
Bonus points if you catch the other main difference: The users table gets an added condition checking on the deleted_at attribute. Yes, I’m using acts_as_paranoid in Mephisto, and this looks like a possible bug since the events query didn’t seem to pick up on the deleted_at clause.
Sorry, comments are closed for this article.



Discussion
This worked a treat for me, although when optimizing like this make sure you actually check your log for which queries are running slow – I actually got a much bigger speed improvement adding an index on the “created_at” column that I was sorting by. But their powers combine…
So I’m a little confused, is the “seeding” a manual process like the example, or was that just a peek at the internals?
BTW, my DataMapper project supports true object uniquing, so the last example is covered as well. There’s also no need to specify either an :include option, or manually seed the cache. It’s all magic, performant-by-default.
Unfortunately, it’s not robust enough yet to replace ActiveRecord in many scenarios, but I’m always open to contributors. :-)
Either way, it’s good to see other people working to make the 1+N Query Problem a thing of the past, nice work. I could definitely use this in the day job.
Sam: Seeding is a manual process, the plugin just provides the caching bit.
This plugin breaks ruby_fork_client (http://blog.segment7.net/articles/2006/04/08/speeding-up-test-runs-with-fork), if you’re using that to run your tests quickly. Haven’t investigated much, so not sure why or which part is responsible, but any DB access results in:
SystemStackError: stack level too deep /Users/xavier/Code/redbubble-trunk/vendor/plugins/active_record_context/lib/technoweenie/active_record_context.rb:20:in `find_one_without_context’ /Users/xavier/Code/redbubble-trunk/vendor/plugins/active_record_context/lib/technoweenie/active_record_context.rb:22:in `find_one’ /Users/xavier/Code/redbubble-trunk/vendor/rails/activerecord/lib/active_record/base.rb:1014:in `find_from_ids’
Yes, the seeding is manual. Lighthouse’s actual code is a bit more complex since I’m pulling in several models. It could be made more automatic (Coda Hale mentioned he’s doing something similar with a :preload option on #find) though.
Also, always index your database. From what I’ve heard, it’s a very common cause of slow rails apps.