A while ago, a client asked us for a way to find and remove duplicate companies from his database. The mysql database behind the rails application contained over 10,000 companies, each having related contacts, phone numbers, email addresses, notes, … The data was imported from multiple sources and inevitably contained a lot of duplicates. We agreed on a two-fold solution
- Automatically merge companies that match certain conditions (same name, legal form, vat no). This would remove 90% of the duplicates without human intervention.
- Construct an interface to find companies with one or more fields in common (name, email, phone, www, …) and merge them set by set. The remaining 10% could be merged under supervision of the user this way.
The core implementation problem was merging together the company records. Even though the client talked about removing duplicates, he did not want to lose any related information on a duplicate. We had a feeling the client would soon ask for doing the same with other record types and therefore decided to implement it generically. The code is on github.
Since a table row can only have 1 value for each column, some attributes of duplicate objects will need to be discarded. To control the attribute values on the merged object, the objects to merge need to be ordered. The first object gets priority, when it contains blank attributes, they can be looked up on the remaining objects in order.
Belongs to associations are backed with a foreign key attribute. When merging the attributes, belongs_to associations are already covered.
Suppose we are merging company A and company B and Company has_one :financial_info.
- Financial info present on company A and not on B => use the one from company A
- Financial info present on company B and not on A => use the one from company B
- Financial info present on both => merge the two
When merging company A with 2 phone numbers and company B with 1 phone number, the resulting company should have all 3 phone numbers. That is, if the phone number of company B is not already on company A as well. Associated objects should be compared and duplicates merged recursively. Comparison may differ for some models. Phone numbers are best compared by flattening them into a string with only numbers, this way the separators do not mess up the comparison.
These associations can be left alone, since they depend on another has_many association that can be merged.
company.merge!(duplicate1, duplicate2, ...)
The object on which merge is called becomes the master object, duplicates are merged into this object and destroyed afterwards. The order in which the duplicates are passed to the merge function matters, since this will determine the priority for merging attributes. When the attribute alpha_code is nil on the master, it will get the value on duplicate1. When not present on duplicate1, it will get the value of duplicate2, and so on.
Compares self with the object and returns true if they can be considered the same. When records with a has_many association are merged, associated objects are compared and duplicates are destroyed.
The names of the attributes that should be merged. Defaults to all attributes minus id, timestamps and other meta data.
The names of associations that should not be merged. Can be used to exclude irrelevant or duplicate associations.
Currently, the merge algorithm does not take cycles in the associations into account. Since the reverse belongs_to associations are never considered, this should not be a problem for most ActiveRecord models. An infinite loop may occur when Company has_many :companies and a company points to itself.
To make sure the merge does not leave invalid foreign keys, referential integrity can be used on the database.