Ewout blogt.

Web development, ruby, and rails

Testing in Seoul

When dealing with time in a rails application, there are three time zones to be concerned with.

  1. UTC
  2. Time.zone
  3. Local time zone

By convention, all times are stored in the database as UTC. ActiveRecord abstracts this by converting dates from UTC to Time.zone when a timestamp is read from the database and doing the opposite when serializing times. That is, if you don’t forget to annotate each Time object that comes into the system with the correct zone.

The local time zone is the one you currently happen to live in, it should not matter at all in the application. Yet, by default, Time objects in ruby use the local time zone. Time.now returns the current local time, which is completely useless and even harmful in development.

I live in Belgium (CET), most of our clients do too. For years, I have been testing with Time.zone = ‘Brussels’. From now on, my testing Time.zone becomes ‘Seoul’.

Why Seoul?

The choice of a testing time zone is somewhat arbitrary, but it should not be equal to the local time zone. When UTC, Time.zone and the local time zone are different, it is easier to spot a bug where local time is used instead of the current Time.zone.

I also think it is important the time zone offset is far away from my local time zone, it makes me think more about time zones instead of just assuming that ActiveRecord will do its job.

Fork me on GitHub

iCloud POP access, a developer’s way

So I got an email from Apple stating it’s time to upgrade my MobileMe account to iCloud. Working in IT for a few years made me trade my early-adopters-enthusiasm for an if-it-ain’t-broken-don’t-fix-it-mentality. But mighty Apple said it’s time now, so I pushed the upgrade button.

I opened Powermail and discovered that it did not receive my new iCloud mail. Apparently, iCloud does not provide POP access anymore. So I had two options:

  1. abandon an email client I’ve been using happily for ten years
  2. be creative

Naturally, I chose the second option and devised a cunning plan.

The idea is to periodically fetch emails from the iCloud imap server and store them in a local inbox, which can be accessed over POP using a local mail server. Software of choice: getmail for fetching, cron for periodical execution, dovecot as mail server.

In the remainder of this article, I briefly explain how to configure this. Familiarity with unix and the terminal is required.

Installation

It is not my intention to describe the software installation in detail, as this may depend on your specific system and preference. I installed getmail as described on their website and dovecot using macports. Cron is part of Mac OS X and any half-decent unix distribution.

Getmail

The configuration file for getmail is .getmail/getmailrc in your home directory. Replace the bold words with your settings.

[retriever]
type=SimpleIMAPSSLRetriever
server=imap.mail.me.com
port=993
username=user@me.com
password=password
mailboxes=("INBOX",)
[destination]
type=Mboxrd
path=/Users/username/.getmail/mbox
[options]
read_all=false

Once this configuration file is created, we can run the getmail command to check if it’s working. (This is a test run on my system, your output may vary.)

$ getmail
getmail version 4.25.0
Copyright (C) 1998-2009 Charles Cazabon.  Licensed under the GNU GPL version 2.
SimpleIMAPSSLRetriever:user@me.com@imap.mail.me.com:993:
  0 messages (0 bytes) retrieved, 168 skipped

Cron

To run the getmail command every 10 minutes, we need to install a cron job. Run crontab -e in the terminal and add the following line.

*/10 * * * * /usr/local/bin/getmail > /dev/null 2>&1

Dovecot

The dovecot configuration file resides in /opt/local/etc/dovecot/dovecot.conf when installed through macports. Mine looks as follows:

protocols = pop3
disable_plaintext_auth = yes
ssl = no
mail_location = mbox:/Users/username/.getmail:INBOX=/Users/username/.getmail/mbox
protocol pop3 {
  listen = 127.0.0.1:11000
}
auth default {
  mechanisms = plain
  passdb passwd-file {
    args = /Users/username/.getmail/dovecot-passwd
  }
  passdb pam {
  }
  userdb passwd {
  }
  user = root
}

Dovecot also needs a password file, we configured it to be located in .getmail/dovecot-passwd in your home directory. You can choose the local password, it is the password local clients need to use to access your local mailbox.

username:{plain}local-password::::/Users/username/.getmail::userdb_mail=mbox:~/mbox

Dovecot should be restarted after configuration.

Configuring your mail client

You can use your favorite POP3 email client now to access your local inbox.

  • Incoming mail server: localhost
  • Port: 11000
  • Username: username
  • Password: local-password

Local scope bug in IE9 javascript evaluator

I recently experienced a what-the-fuck-moment when testing some javascript code on internet explorer. All ran well on IE6 upto 8, but not in IE9. After a bit of fiddling, I was able to create a small failing example.

function fail() {
  var p = {}
  var c = p.c = [];
  c.push({r: (c = [])})
  c.push('lala')
  alert(JSON.stringify(p.c)); // should be [{"r":["lala"]}], but is [] in IE9
}
fail()

Looks like a bug to me. Just putting it out here, since I did not find a place on the Microsoft site to report bugs for IE9.

Do not store database backups in git

Some people use revision control systems like git or svn for storing and managing backups. The idea sounds appealing, because consecutive backups only differ slightly and revision control systems can optimize the space required to store them. The following paragraphs will explain why this is just plain wrong.

Data retention

An RCS is built for infinite data retention, as writing source code is expensive and storage is cheap. Backups lose their value with age, nobody cares about the daily backups of 5 years ago.

Performance

An RCS is built for tracking a large number of small, interdependent files. Backup files are large and backups from multiple applications are unrelated. Git does not handle large files well and repositories can become unusably slow or use an insane amount of memory when pushing, pulling or checking out.

Corruption

When a RCS repository gets corrupted, chances are that all backups stored inside become inaccessible. Svn stores revisions as deltas relative to the previous revision. When one delta file becomes unreadable, future revisions are affected. Git does not just store the deltas and is more defensive against corruption with built-in SHA-1 hashing. Furthermore, git repositories can be easily replicated with full history so the chances of corruption are slim. But even with the best RCS tools, there is an extra non-trivial layer between the filesystem and the data, and this layer is a liability.

Yet another tool

Every machine that uses backups requires the RCS tool to be installed. This is only a minor inconvenience, but why use yet another tool when the standard unix tools work just fine?

What to use then?

My advice for storing database backups is simple: create timestamped sql dumps periodically and compress them with bzip2. Keep them around as long as your data retention policy requires it or until you run out of space, which will be a long, long time in an era where hard disk space is measured in terabytes.

Optimize GROUP BY a ORDER BY b in mysql

The following query took 150 ms on a dataset of a few thousand rows, even though there were indexes on companies.id, companies.name and people.company_id.

SELECT companies.id, count(people.id)
FROM companies
LEFT JOIN people on companies.id = people.company_id
GROUP BY companies.id
ORDER BY companies.name

Explain revealed “Using index; Using temporary”. Turns out that mysql can only use a single index for grouping and sorting. When sorting and grouping on different columns, a temporary table needs to be created and sorted after grouping.

The solution? GROUP BY companies.name, companies.id. The query now takes under 10 ms.

Selenium integration testing: optimize login

Integration tests for highly dynamic web applications are currently bound to happen in the browser. There exist in-memory solutions that execute javascript, but none of them seem to have a fully accurate DOM implementation at the moment. Testing in the browser is slow and every optimization is welcome.

A logged in user is a prerequisite for many integration tests. Performing this step in the browser requires going to the login page, filling out the form, submitting and waiting for the initial page. Testing the login procedure is of course crucial, but it should not be tested a thousand times.

When using the cookie session store in rails, a logged in user will have a signed cookie containing his user id. On every page load, this cookie is sent back to the web application, where the signature is verified and the user is assumed to be logged in. Integration tests can be optimized by storing the contents of that cookie, and putting it back in the browser when a logged in user is required.

We implemented this optimization in our cucumber integration tests which use selenium. The code is below, hope it helps.

Given /^I am authenticated$/ do u = Factory(:user, :id => 1, :login => "uname", :password => "pass") c_name = Rails.configuration.action_controller.session[:key] if cookie = Thread.current[:selenium_cookie] selenium.create_cookie("#{c_name}=#{cookie}", :path => '/') else visit "/" selenium.wait_for_page 5 fill_in("login", :with => "uname") fill_in("password", :with => "pass") click_button("Inloggen") selenium.wait_for_page 5 response.should contain("Uitloggen") Thread.current[:selenium_cookie] = selenium.cookie(c_name) end end

JDBC access with distributed ruby

JDBC, java database connectivity, is the standard database driver interface on the java platform. Since java is ubiquitous, most database vendors provide JDBC drivers to access their data. When a ruby application requires using a legacy data source, sometimes the only option is going through JDBC.

The database toolkit Sequel can use JDBC data sources, but only when running on JRuby. Although JRuby is compatible with ruby 1.8.7, not every application can be run on it, especially when it depends on gems that define C extensions.

Fortunately, distributed ruby exists. It allows a server to expose an object which its clients can use like any local ruby object. A JRuby server can expose the Sequel object, which other ruby clients can use to access JDBC data sources.

Both server-side and client-side code is pretty straightforward. The client should also depend on the sequel gem, since only objects and not their class definitions can be marshaled over distributed ruby.

# server side
require 'drb'
require 'java'
require 'rubygems'
require 'sequel'
DRb.start_service 'druby://localhost:20000', Sequel
# It might be needed to instantiate the driver here,
# so it is available when a connection string is given.
DRb.thread.join
# client side
require 'drb'
require 'rubygems'
require 'sequel'
DRb.start_service
sequel = DRbObject.new nil, 'druby://localhost:20000'
db = sequel.connect("jdbc connection string here")
db.from('table').each {|r| puts r[:id]} # play!

NativeException

Inside the JDBC driver, native java exceptions can be raised. JRuby wraps these exceptions in a NativeException class, so ruby can rescue them and provide a stack trace. Distributed ruby provides stack traces for exceptions raised in a remote ruby, but it cannot handle NativeException because the class does not exist in MRI ruby. In short, when an exception is raised by java, the following cryptic error message will appear.

DRb::DRbUnknownError: NativeException

To fix this and get a full stack trace of NativeExceptions, the class needs to be defined in the client.

class ::NativeException < RuntimeError ; end

Using arel with rails 2.3.4+

Arel, a relational algebra, allows expressing complex relational (SQL) queries in ruby. It is used underneath ActiveRecord 3 for generating queries, where it is underapreciated in my opinion. Arel is extremely powerful and should become a weapon of choice for query manipulation in ruby. Although it is still in beta, it is very stable and usable.

There is one flaw: it depends on ActiveSupport 3 and has a hidden dependency on ActiveRecord 3 for the database drivers. Although the latter will be resolved by releasing the database drivers as a seperate gem, or inside arel itself, the ActiveSupport dependency prevents using arel in earlier versions of rails. Turns out this dependency is really artificial: arel plays nice with rails 2.3.4 and up. I forked the arel project and added some minor modifications so it can be installed and used in these older versions of rails.

Usage

# config/environment.rb
config.gem :arel-compat, :lib => 'arel'

Now to integrate arel on a low-level with the models, add an initializer.

# config/initializers/arel_integration.rb
class ActiveRecord::Base
   class << self
     delegate :[], :to => :arel_table

     def arel_table
       @arel_table ||= Arel::Table.new(table_name, :engine => arel_engine)
     end

     def arel_engine
       # Not correct when working with multiple connections.
       @@arel_engine ||= Arel::Sql::Engine.new(ActiveRecord::Base)
     end
  end
end

After that the fun starts. Note that arel is low level: when executing an arel query, an arel result will be returned instead of model objects. However, it is easy to use the sql returned by arel to select model objects.

arel = Person.arel_table.where(
  Person[:first_name].matches("test%").and(
  Person[:last_name].eq(nil)))
Person.find_by_sql(arel.to_sql)

Known issues

One spec fails on the mysql driver: offset without a limit.

Person.arel_table.skip(10)
=> SELECT `people`.* FROM `people`

When a limit is specified, it will behave correctly.

Person.arel_table.skip(10).take(5)
=> SELECT `people`.* FROM `people` LIMIT 10, 5

Since offset is seldomly used without a limit, we did not bother to patch our arel fork.

Postscript

With the rails ecosystem growing, dependencies become an important issue. A medium sized application can easily depend on 50 gems. Bundler solves the gem resolution problem so an application has a compatible set of gems. However, bundler can only resolve gems whose declared dependencies are compatible. When adding arel to a rails 2.3.4 project, bundler fails.

That is why the other problem with gem dependencies lies with the gem developers themselves. Arel should not depend on ActiveSupport, period. When presented as a “framework to build ORM frameworks”, it should not bring in a massive dependency that is incompatible with some environments.

HTTP Basic Authentication with Devise

Devise is becoming a popular gem for adding modular authentication to a rails application. It builds on top of warden, which provides a pluggable architecture for multiple authentication strategies at the rack level.

Since the cool kids are using it, we did not want to be left out and ported our main application from restful_authentication. Aside from being cool, the switch will make porting to rails 3 easier, the latest devise is compatible.

The application uses basic http authentication for private RSS feeds and ical subscriptions. This is pretty common at the service level of an application, machines do not like login forms. Devise works with basic authentication out of the box, but it will only work when the authentication headers are already present in the request. When they are not, devise will return a 302 redirect to the login form and the RSS reader gives up.

The solution is to create a new devise strategy  in config/initializers/devise.rb

class HttpAuthenticatableNonHtml < Devise::Strategies::HttpAuthenticatable
  def valid?
    not request_format.html? or super
  end
  def http_authentication
    super or ''
  end
end
Warden::Strategies.add(:http_auth_non_html, HttpAuthenticatableNonHtml)

Warden needs to be instructed to use the strategy, inside the Devise.config block.

config.warden do |manager|
  manager.default_strategies.unshift :http_auth_non_html
end

The strategy will return a 401 with authentication realm when accessing a protected resource that is not html.

How warden and devise work in rails

Figuring out this solution required diving into the warden an devise code, which is quite intimidating at first. I created a diagram that hopefully makes it easier to understand the basic working of the authentication stack.

  1. The HTTP request enters the rack stack.
  2. Warden gets the request and forwards it in the rack stack, adding an environment variable “warden” that points to an authentication proxy.
  3. The request gets dispatched to the rails controller, which may call authenticate_user! from a filter. This is an alias for request.env['warden'].authenticate!(:scope => :user).
  4. The warden proxy picks an authentication strategy. Any strategy for which valid? returns true is tried.
  5. When authentication succeeds, a user object is returned to the controller. When it fails, the symbol :warden is thrown down the stack, and caught by the warden rack application. The latter will return a response, which is a redirect to the login page by default. This can be overridden by calling warden.custom_response!.

Generic deep merge for ActiveRecord

A while ago, a client asked us for a way to find and remove duplicate companies from his database. The mysql database behind the rails application contained over 10,000 companies, each having related contacts, phone numbers, email addresses, notes, … The data was imported from multiple sources and inevitably contained a lot of duplicates. We agreed on a two-fold solution

  • Automatically merge companies that match certain conditions (same name, legal form, vat no). This would remove 90% of the duplicates without human intervention.
  • Construct an interface to find companies with one or more fields in common (name, email, phone, www, …) and merge them set by set. The remaining 10% could be merged under supervision of the user this way.

The core implementation problem was merging together the company records. Even though the client talked about removing duplicates, he did not want to lose any related information on a duplicate. We had a feeling the client would soon ask for doing the same with other record types and therefore decided to implement it generically. The code is on github.

Merging attributes

Since a table row can only have 1 value for each column, some attributes of duplicate objects will need to be discarded. To control the attribute values on the merged object, the objects to merge need to be ordered. The first object gets priority, when it contains blank attributes, they can be looked up on the remaining objects in order.

Merging associations

belongs_to

Belongs to associations are backed with a foreign key attribute. When merging the attributes, belongs_to associations are already covered.

has_one

Suppose we are merging company A and company B and Company has_one :financial_info.

  • Financial info present on company A and not on B =&gt; use the one from company A
  • Financial info present on company B and not on A =&gt; use the one from company B
  • Financial info present on both =&gt; merge the two

has_many, has_and_belongs_to_many

When merging company A with 2 phone numbers and company B with 1 phone number, the resulting company should have all 3 phone numbers. That is, if the phone number of company B is not already on company A as well. Associated objects should be compared and duplicates merged recursively. Comparison may differ for some models. Phone numbers are best compared by flattening them into a string with only numbers, this way the separators do not mess up the comparison.

has_many :through

These associations can be left alone, since they depend on another has_many association that can be merged.

The API

company.merge!(duplicate1, duplicate2, ...)

The object on which merge is called becomes the master object, duplicates are merged into this object and destroyed afterwards. The order in which the duplicates are passed to the merge function matters, since this will determine the priority for merging attributes. When the attribute alpha_code is nil on the master, it will get the value on duplicate1. When not present on duplicate1, it will get the value of duplicate2, and so on.

Hooks

merge_equal?(object)

Compares self with the object and returns true if they can be considered the same. When records with a has_many association are merged, associated objects are compared and duplicates are destroyed.

merge_attribute_names

The names of the attributes that should be merged. Defaults to all attributes minus id, timestamps and other meta data.

merge_exclude_associations

The names of associations that should not be merged. Can be used to exclude irrelevant or duplicate associations.

Known issues

Currently, the merge algorithm does not take cycles in the associations into account. Since the reverse belongs_to associations are never considered, this should not be a problem for most ActiveRecord models. An infinite loop may occur when Company has_many :companies and a company points to itself.

To make sure the merge does not leave invalid foreign keys, referential integrity can be used on the database.