ChoiceMaker Technologies
ChoiceMaker
by David M. Raab
DM News
October, 2003
Most direct marketers probably assume that software to match names and
addresses originated with merge/purge systems. But there is
actually a long history of earlier matching technology. For
example, the original Soundex algorithm, designed to overcome
spelling variations by building phonetic name indexes, was
patented in 1918 and used extensively in setting up the original
Social Security system files in the 1930's.
Government agencies have continued to develop matching
systems independently of direct marketers. Apart from the
prosaic reason that the two groups had little contact, there is
also a subtle but significant difference in their requirements.
Merge/purge systems were primarily developed to pool names for
mailing lists. Since the price of an error was the cost of a
duplicate mail piece, accuracy could be compromised to gain
speed and efficiency. Government systems were typically used to
search for individuals in a single existing file, whether of
criminal suspects, immigration documents, or tax records.
Accuracy and real-time response were much more important;
handling big batch jobs with disparate sources was not.
Applications such as enterprise-wide customer relationship
management actually require functions more similar to
governmental search systems than to traditional merge/purge. So
it's not surprising to see an increasing number of commercial
products with origins in government applications, as well as an
increasing number of products with both types of users. Nor,
given the higher priority placed on accuracy, is it surprising
to see technical innovations that promise more reliable results.
ChoiceMaker 2.1 (ChoiceMaker Technologies,
646-336-4441,
www.choicemaker.com) illustrates these trends nicely. The
system was originally developed to help the New York City
Department of Health find duplicates in its registry of
children's immunization records. It has since been sold to
commercial as well as other government clients. And it employs
technology that, according to the vendor, has proven more
accurate than competitors in several head-to-head tests.
Actually ChoiceMaker combines several innovative
technologies. At the lowest level, the system is written in
Java, which lets it run on nearly any hardware and connect to
nearly any data source. Inputs are defined with a schema that
not only identifies the available fields, but can also specify
relationships across data tables, incorporate validity checks,
parse entries into separate elements, and create derived values
such as Soundex codes. Processing rules are written either in
Java or in ChoiceMaker's own ClueMaker language. ClueMaker
extends Java with specialized matching functions such as field
swaps (e.g. comparing first name in one record against last name
in another record) and data stacking (allowing multiple values
in a field, such as old and new address). ClueMaker statements
are automatically converted into Java for execution.
ChoiceMaker uses this technology to read, parse and
standardize input in fairly conventional fashion. The processed
data is then stored in a reference table. When a new record is
presented for matching, the system selects records from this
table for comparison. Like other systems, ChoiceMaker limits
this selection to records that are similar enough to be
potential matches. ChoiceMaker adjusts the selection based on
the distinctiveness of the input: for an unusual name like
Guardado, all records with the same name may be returned; for a
common name like Nelson, the selection might be restricted to
matches on name plus Zip code. The fields to use in these
selections and the maximum number of names to return for each
search are specified during system setup. The determination of
how many selection criteria are needed is made automatically by
the system, using precalculation statistics on the frequency of
different values within the reference table. A handful of other
matching systems use similar techniques, but most matching
software is much less advanced.
Once the candidate records are returned, ChoiceMaker matches
these against the input. This is the most unusual, and
sophisticated, aspect of ChoiceMaker. The system first evaluates
"clues" that indicate whether records match or differ: same
first name, phonetically similar last name, different birth
years, and so on. These clues are written in ClueMaker and can
be quite complex--for example, checking whether a pair of
records contains one address in the Midwest or Northeast and
another in Florida or Arizona, to find people who head south for
the winter. Clues may yield "match", "differ" or no result if
appropriate data is not available. Where some gradation is
appropriate--such as degree of near match or match on common
name vs. match on unusual name--separate clues are created for
each level. This is part of the reason a typical installation
uses about 200 clues against many fewer data elements.
The system must combine the individual clue results to reach
a final decision. ChoiceMaker does this by assigning statistical
weights to the clues and comparing the combined weights of the
"match" clues vs. "differ" clues. Record pairs with a clear
result are classified automatically; others can be flagged for
manual review.
The weights are determined using a machine learning technique
called "maximum entropy modeling". This involves submitting
several thousand records with matches already marked; the system
then automatically derives the set of weights that most closely
predict the marked matches. Such automated training is highly
unusual in the world of matching software: even the most
advanced systems typically rely on users to manually refine
match rules by looking at missed or false matches and making
adjustments.
Of course, ChoiceMaker still requires significant human
effort: to define input data, specify parsing and
standardization rules, build new clues, create test cases, and
review results. The vendor says it takes about two weeks of
labor to set up a sophisticated matching process. Whether this
is more or less than other systems would depend on the
circumstances: for unusual matching problems, ChoiceMaker would
probably have an advantage. The system includes several tools to
help with development, but considerable expertise is still
required.
ChoiceMaker was originally developed in 1998 and has several
current installations. Pricing depends on the application and
can range from $7,500 for a development license to hundreds of
thousands of dollars for a large implementation.
David M. Raab is president of ClientXClient, a consulting
and software firm specializing in customer value optimization.
He can be reached at
draab@clientxclient.com.
More News & Reviews
Copyright 2003 CLIENTxCLIENT Contact:
info@CLIENTxCLIENT.com
Call: 908.350.3012
|