|
|
Customer Matching Systems
by David M. Raab
DM Review
June-October, 2001
Underlying every grand plan for customer relationship management is a
centralized customer database--one that consolidates information
about each customer from sources throughout the company. This
complete picture of each customer is the foundation for
understanding what actions will get the greatest value from the
relationship.
No one with meaningful CRM experience would minimize the
effort needed to build such a database. In fact, most
practitioners quickly acknowledge that it is by far the greatest
technical challenge involved in a new CRM project. (Most, but
not all: vendors of integrated front office systems often assume
their standard operational database will be the central customer
repository. But this breezy confidence typically crumbles when
they are told they must integrate data from back office
operations and from whatever touchpoints--there are always a
few--that run outside of the main front office system.) Still,
once the difficulty of building the database is solemnly noted,
attention usually swings to more exciting tasks like choosing
vendors and fighting political battles. The nuts and bolts of
customer data consolidation are set aside as problems to resolve
during implementation. The unspoken assumption is the available
tools all give roughly equivalent performance, so there is no
point to assessing them in detail.
In an IT industry where people have strong opinions on all
conceivable, and a few incomprehensible, technical issues, this
relaxed indifference is a remarkable anomaly. Though it's
tempting to see it as evidence of hitherto-unsuspected reserves
of human rationality, it more likely reflects ignorance of the
issue at hand. Inept marketing by vendors who have failed to
effectively distinguish their products could play a role as
well.
In fact, there are significant differences among customer
data consolidation tools and techniques. These relate to the
very specific task of matching customer names and addresses--an
esoteric process that is unfamiliar to most corporate IT groups,
even if they have experience with the types of consolidation
required for non-customer data. The difference is that most
non-customer consolidation revolves around exact matches,
conversion rules, or translation tables. These processes are by
no means trivial: researching, building and maintaining them is
a major task in a company with many complex systems. But they do
ultimately produce a set of rules that determine unambiguously
whether or not two records match (at least in most cases).
Name and address matches are inherently less certain. There
is often no way to know when looking at two customer records
whether or not they refer to the same person. The names are
spelled similarly--is it a real difference or a data entry
error? City names differ within the same postal code--is the
code wrong or is one city name a colloquial variation or vanity
address? Women with two different last names share the same
address: are they separate people or one woman with her married
and unmarried names? The same name appears at two different
addresses: is it two people, one person who moved, or one person
with two addresses? Two dissimilar records have the same phone
number: are they the same person, or has one moved and the
number been reassigned? Even a unique identifier like Social
Security Number can be misreported, miskeyed or just plain
missing. As privacy concerns and regulations accumulate,
identifiers like telephone and Social Security number will be
less available, so using them as a matching shortcut will be
even less useful. And as life becomes generally more
complicated, people have more non-matching attributes: how many
phone numbers do you have? How many e-mail addresses? Do you
receive mail at a Post Office box for privacy or business
reasons?
And let's not even get started on households or business
matching.
In short, there is no way to create a straightforward
mechanical process for name and address matching. But systems to
provide approximate matches do exist. In fact, there are three
levels of such systems, each building on the foundation of its
predecessors.
The most basic matching systems were built to identify
(merge) and remove (purge) duplicate names on mailing files.
Major vendors are Group 1 Software and FirstLogic i.d.Centric
(Postalsoft); other competitors include Sagent and SAS's
DataFlux subsidiary. Low-end, PC-based alternatives are
available from Mailer's Software and Peoplesmith.
What these systems essentially do is compare one record with
another. But it would be horribly ineffective to simply treat
each record as one large string and compare the strings to each
other: there are so many variations in how addresses may be
formatted that legitimate matches would be rejected because the
strings didn't line up. So merge/purge systems first split the
input records into standard fields, such as first name, last
name, street, city, and state. There are usually about a dozen
such categories, including things like title (Mr., Mrs.),
generation (Jr., Sr., III), street type (St., Ave., Blvd.),
directional (North, South) and apartment number.
Often the input record is already split into such
fields--hopefully, the data was captured that way in the first
place, which is by far the most effective approach. If not, the
merge/purge system will parse the record into components,
looking for key words, standard formats (e.g. a 5 digit string
is probably a Zip code), positions within the record (usually
the name line comes first, then the street line, then the
city/state/Zip) and positions within each line (usually the
first name comes before the last name). Parsing is not perfect,
particularly when records in the same file have been entered in
different formats (e.g., a mix of last-name-first and
first-name-first) or when business addresses are involved. But
most systems can get most records parsed correctly.
The second preparation step usually involves standardization.
This mostly involves looking up words and their equivalents in
huge translation tables. Some standardization changes nicknames
and variations to a standard name: so Elizabeth, Liz, Beth and
Betty are all changed to Elizabeth. Titles might also be
standardized to change Mister to Mr. In addition, postal
standardization is applied to ensure street and city names are
spelled consistently and, when possible, to ensure that the
postal code matches the rest of the address. This requires more
than simple table translations; in fact, postal standardization
involves complicated parsing, string matching and validation
processes, which are generally embedded in systems outside the
merge/purge product. Postal standardization is sometimes run
before the merge/purge process begins: the output would be a
file in which the postal elements were parsed and standardized,
although the merge/purge system would still parse and
standardize the name line.
Once the data is parsed and standardized, the merge/purge
system sorts it to bring together records that are likely to be
matches. This avoids having to compare every record to every
other record, which would be cost-prohibitive. Most systems
generate a sort key based on components of selected
elements--say the first three characters of the Zip code, first
three letters of the street name and house number. Some systems
generate multiple keys and run through the file several times in
the different sequences; this avoids missing matches because of
a flaw in one element of the key, such as a bad Zip code.
In older merge/purge systems, the matches themselves were
often identified by a match key that was essentially an extended
version of the sort key: for example, it might have the first
three characters of the Zip code, first three letters of the
street name, house number, and first, third and fourth letters
of the last name. Records with the same key were assumed to be
actual matches. While very efficient from a processing
viewpoint, this method is not very accurate--it rejects records
because of minor differences and accept records that are
obviously different even though the keys were identical.
Changing the composition of the key usually means trading false
matches against missed matches without reaching a satisfactory
level of both. This method is no longer used by major
merge/purge products, although it still sometimes appears in
less sophisticated systems.
Today's standard approach is to move through the sorted file
comparing groups of records. One method is to comparing all
records within a certain distance of each other (e.g., up to ten
records away). Another is to compare all records within a "break
group", which is set of records sharing key elements, such as
the same Zip code and street name. The break group method is
more flexible, since it will look at all similar records even
when there are many sharing a particular set of values. But some
systems limit on the size of the break group itself, in which
case even adjacent records may not be compared if they fall on
different sides of an arbitrary intra-group split.
The comparisons themselves measure the degree of similarity
between each pair of elements within the parsed records and then
apply a business rule to determine whether this combination of
similarities constitutes a match. Similarity is determined by
comparing the strings of text within each element. Of course,
exact equality is easy to find, so the challenge is to identify
and rank near matches. Systems may calculate the number of
characters that are the same, check for sequences that are the
same, adjust for transpositions, or even allow for letters that
are adjacent on a typewriter keyboard. Some systems look for
phonetic equivalents. Some preprocess the string by removing
vowels or double letters. Some treat numbers differently.
Different matching methods are applied to different field types:
it makes no sense to apply phonetic matching to a Zip code.
Sometimes the user can define the algorithm that applies to each
field; in other cases, the algorithm is predetermined. Generally
the algorithms produce a score that indicates the closeness of
the match between the two strings. Sometimes the system
automatically combines the element scores into a record-level
score, and the only job of the user is to decide what
record-level score will count as a match. In other cases, users
specify what scores count as element matches and how these are
combined to qualify for a match. For example, a rule may be the
system must match on last name, street address, and at least
five of eight other elements. Even systems that allow precise
user control will provide default settings that the majority of
users accept in practice.
The trial-and-error labor needed to fine-tune match settings
is rarely worthwhile in the world of direct mail list
processing, where many different lists are processed together
and time is usually short. Companies building a customer
database may be more willing to invest in improving long-term
results by tailoring the system to the quirks of their own
source files.
Once the matches are identified, a merge/purge system is
designed to choose one record to keep and to discard the others.
In mailing list preparation, the selection often depends on
which source would charge least to use the record--so
merge/purge systems generally let users specify list priorities,
and sometimes have distribution functions to randomly allocate
matches across sets of source files. These systems also have
standard reports to show duplication across input systems, again
to help with direct mail analysis and list rental payments.
Matching for in-house customer databases has a different set of
concerns. Users are more likely to require tools to consolidate
data from the different sources and to pick the best information
where conflicting data appears. Such functions are not
necessarily available in merge/purge systems. But they are
important capabilities of the more advanced customer matching
systems that will be discussed next month.
* * *
The first article in this series described the most basic
type of customer matching software, merge/purge systems. These
parse incoming addresses into elements such as first name, last
name, house number, street name, city, state and postal code.
They then standardize these elements, correcting for variations
such as misspellings, nicknames, and alternate place names.
Finally, they compare the elements in pairs of records,
calculate a similarity score, and flag as matches any pair
scoring above a user-specified level.
Merge/purge systems are relatively fast, cheap, and easy to
set up. But applying the same scoring formula to all records
inherently fails to take into account significant differences in
particular situations. For example, matching an uncommon last
name should count for more than matching a common one. The
second class of customer matching software is able to take such
differences into account.
These systems work by looking for patterns in the input
records and applying different rules to different patterns.
Patterns are applied at two levels: to identify data elements
and to determine treatment of record pairs. Pattern-based
element identification is particularly good at working with
complex name lines, such as "John Smith and Jane Doe", "Jane Doe
Smith" and "Mr. and Mrs. John Smith". A simple parsing routine
would look at the first and last word on the line, and come up
with first and last names of "John Doe", "Jane Smith" and either
"John Smith" or "Mr. Smith". That is, it would conclude each
name is significantly different, and miss the presence of two
individuals altogether.
A pattern-based parser would recognize common first names,
last names, titles and conjunctions, look at the patterns these
are forming, and apply rules to identify the elements correctly.
Such a parser would also adjust for generational indicators such
as Jr., Sr. and III, industry terms identifying relationships
such as "ITF" for "In Trust For", and business aliases such as
"John Smith dba Smith Supplies".
As with most standardization and parsing processes, this
approach relies heavily on key word tables that identify how
different words are commonly used in different contexts. The
scope and variety of these tables is critical to the accuracy
the parsing process. Most pattern-based systems let users modify
these tables to reflect conditions in their particular files,
such as specialized industry terms, company-standard
abbreviations or local geography. The pattern tables themselves
can also be modified to accommodate known input peculiarities,
such as a practice of flagging the last name with a special
character ("Henry @James"). In effect, key word and pattern
tables provide the knowledge that a human reviewer would
intuitively bring to bear. Since they have greater memory
capacity and behave consistently regardless of personality or
fatigue, the tables are in some ways superior to human
reviewers, particularly on routine processes. (But where
accuracy is critical, most firms still rely on manual review and
research to resolve ambiguous cases.)
Pattern-based matching rules rely on elements identified at
the parsing step, and apply different rules to different element
patterns. These patterns may look at the sequence of element
types: for example, a pattern that identified a female first
name followed by two possible last names ("Jane Doe Smith")
might trigger a rule to treat the middle name as a potential
last name for matching purposes. Or rules might take into
account which elements are present--for example, giving higher
weight to a matching first name if there is also a matching
middle initial.
Different systems take different approaches to how rules and
patterns are defined. Some are highly structured, offering fixed
elements, match types (e.g. perfect, close or none), and outcome
classes (e.g. accept, reject, or ambiguous); in this case, the
user must only determine how to classify each of the large but
finite number of possible combinations of element match types.
Other systems let users write rules in a scripting language that
defines what to look for and how to react; this gives almost
total flexibility. Whatever process a vendor applies, nearly all
systems provide a default set of patterns and rules to help get
started. Because users can identify exactly which rule was
applied to accept or reject a given match, it is relatively easy
to modify the default rules by reviewing outcomes and making
adjustments over time.
The rule-based approach also lets users apply additional
processing only to ambiguous matches--thus allowing a more
detailed review of the available data when needed, without
performing unnecessary processing on simple cases. One
application of such processing is to resolve cases of
"chaining"--where record A matches record B and record B matches
record C, but records A and C do not match each other. Users may
define rules to determine when to accept such matches and when
to reject them. This sort of incremental processing combined
with the greater inherent accuracy of pattern-based matching
lets pattern-based systems find 90% to 95% of possible matches,
compared with rates of 50% to 70% for merge/purge systems. Of
course, your mileage may vary.
On the other hand, merge/purge systems run faster: multiple
millions of records per hour, compared with one million or fewer
per hour for pattern-based matching. These figures are crude
guidelines, since speed varies greatly for all types of matching
software depending on the hardware and algorithms involved.
Pattern-based and merge/purge systems also differ in ways
other than matching techniques themselves. Because the
pattern-based systems were designed primarily to match customer
records, they maintain persistent customer identifiers from one
update to the next. This is unnecessary in a merge/purge system,
which is built largely to remove duplicates from a group of
lists that are rented for one-time use. Maintaining a persistent
customer ID is relatively straightforward, since it largely
involves appending the ID to the input records in each matching
session and carrying it through to the output. But it does
involve some nuances, such as ensuring that the same ID is
applied if a customer vanishes for a few cycles and then
reappears, or if the customer moves and a record later shows up
at the old address. When IDs are applied to households as well
as individuals, things get more complicated still--now you need
rules to handle household mergers such as weddings, and
household splits such as divorces or children leaving to
college. In fact, household definition is often a very
contentious part of the database development process, since
different users have different definitions that make sense for
their own purposes. Multiple household definitions, each with
its own set of IDs, are quite common in large consumer marketing
databases.
The desire to build a permanent customer database also leads
pattern-matching vendors to include extensive facilities for
data consolidation. These range from simple functions to
aggregate values such as purchases recorded in different billing
systems, to complex rules to select the "best" version of an
element such as a Social Security Number or primary address.
Although this sort of consolidation does not rely directly on
pattern-based matching, it may use the system's assessment of
the quality of different input records to help determine which
record to treat as most reliable.
Pattern-based systems are also much more likely than
merge/purge software to provide an API for real-time processing
of individual records. This is commonly used to integrate the
matching process with operational systems such as order entry or
customer service, to quickly identify individuals as existing or
new customers. Most operational systems provide their own,
simple matching routines, but it makes sense to leverage an
advanced pattern matching system if the enterprise has already
purchased one. This provides results that are both more accurate
and more consistent than the operational system would provide by
itself, as well as ensuring that searches are made against the
entire customer universe rather than only the customer records
residing in a particular operational silo.
Major vendors of pattern-based matching systems include
Harte-Hanks Trillium Software, Innovative Systems Inc., Group 1,
and Postalsoft i.d.Centric. The latter two vendors also sell
merge/purge systems, but their pattern-matching software uses
different technology. Vality also sells pattern-based matching
software, but relies on users to build their own keyword and
pattern tables--a major undertaking that the other vendors avoid
by providing users with prebuilt tables and rules. A newcomer to
the market is DataMentors, which draws on its founders
experience building pattern-based matching systems at pioneering
marketing database vendor OKRA Marketing.
* * *
The purpose of name and address matching software is to
identify sets of records that refer to the same person. The
simplest matching systems do this by directly comparing the
records to each other. Certainly this is the most obvious
approach. But as matching software evolved, developers found
that external data can help the process considerably. Even basic
merge/purge systems rely on tables of names, business terms,
cities, and other information for parsing and standardization.
Address standardization in particular relies not simply on
tables of common terms and spellings, but on files that list all
known valid addresses. In the U.S. and many other countries such
files are prepared by the local postal service. Sometimes they
must be gathered or updated through other means.
The main advantage of a fixed reference table is accuracy. It
provides a way to determine whether two similar records really
refer to the same entity: if the closest match for both is the
same reference record, they can be assumed to be the same. Of
course, there are limits to this approach, since the reference
table itself may be missing a valid entry or the input record
may be so badly mangled that no reasonably close match is found.
So most systems allow input records without a near match on the
reference table to retain a separate identity. Sometimes these
records are added to the reference table itself with a special
code to indicate their origin. That way if a similar record
appears again, the system will at least recognize it as matching
the previous record.
Reference tables can also yield significant processing
economies, particularly if the same table is shared across
multiple installations. It is obviously more efficient to build
a comprehensive address table once and then share the copies,
than for each firm to assemble an address table on its own.
Similarly, it is more efficient for a service bureau to run the
records of many clients against the same reference table than to
load a separate reference table for each client. This is true
even if the client-specific reference tables, which would
presumably be limited to that client's customers and prospects,
were each smaller than the single common reference table.
Running against a common reference table also lets the service
bureau keep that table loaded constantly rather than loading and
unloading the individual client tables on, say, a monthly basis.
This means each client's records can be processed more
often--nightly or perhaps even in real time. In addition, the
common reference table could itself be updated continuously with
new and corrected data, so each client would get the benefit of
the most current information.
But there is a fly in the reference table ointment.
Processing records against an address reference table alone will
not identify duplicates among individuals. This requires
comparing names as well as addresses. If name-level matching is
needed, then a name-level reference table is needed as well.
Even merge/purge and pattern-based matching systems that use
address reference tables must still load the client's own
customer and prospect tables for name matching. So the full
advantages of reference-based matching are not available to
these systems.
Over the past few years, a handful of vendors including
Acxiom, Experian and Donnelley Marketing/InfoUSA have introduced
name-level reference table matching. The challenge in developing
these systems is to build the reference table itself: after all,
this involves nothing less than a database with every individual
in the country. No government agency provides such a file in the
U.S. Thus each vendor needed to assemble its own database from a
variety of sources. These include public records such as
telephone directories, voter registrations and real estate
listings, as well as private sources such as catalog merchants
and financial institutions. While this is a costly and
complicated process, it is certainly possible with today's
technology.
The basic process is that each vendor run the records from
its various sources through a conventional matching process.
Records identified as belonging to a unique individual are
assigned a fixed ID. The reference table thus consists of all
significant variations among input records: where several
versions exist for the same individual, there will be several
reference table records with the same ID. When clients submit
their own files, these are matched against the master table and
the system returns the original record plus the matching
standard ID. The reference table itself never leaves the custody
of the vendor, and clients see only the information they provide
plus the ID the vendor has assigned. This contrasts with address
reference tables, which are frequently installed on in-house
systems.
Because the master table may contain several records
describing the same individual in different formats, an input
record using any of these formats can be matched directly. This
reduces the amount of processing-intensive "near match" logic,
providing faster and more efficient performance. Even real-time
processing of individual records is possible, although most
reference-based matching still runs in batch.
The mix of inputs as well as processing techniques vary from
vendor to vendor, so results from the different reference-based
systems are not necessarily the same. But all vendors report
significant improvements--often, nearly double the match rates
of conventional merge/purge or pattern-based systems. On a
reasonably well maintained file, this might translate into two
to eight additional duplicates per hundred records input. A
reference-based system also eliminates the much smaller number
of false duplicates that occur when two records are similar
enough to match but actually refer to different individuals.
Why to reference-based systems find so many additional
duplicates? There is more involved than greater precision in
matching. Specifically, the reference tables can include a
history of the same individual at different addresses or under
different names (e.g. before and after a marriage). These
connections, derived from change of address transactions, legal
records, financial institutions and similar sources, cannot
possibly be made by comparing name and address records directly.
While some false connections are inevitable, each vendor has
tuned its rules to keep errors at what it considers an
acceptable minimum. Users with different preferences cannot
change these rules directly, although most vendors let clients
apply their own splitting or combination rules after the
standard processing. This contrasts with merge/purge and
pattern-based matching systems, which let clients tighten or
loosen matching rules to meet their individual purposes. The
reference-based matching vendors argue this is unnecessary
because their standard processes yield such accurate results.
Clients can also propose corrections to the reference tables,
although not all clients are willing to share such information
and the vendors decide whether or not to accept a proposed
change. When corrections are made, vendors can notify clients by
publishing the list of affected IDs. Because the vendors keep
track of which IDs have matched to each client's input, they can
send each client only the list of relevant IDs.
In addition to providing greater accuracy and operational
efficiency, reference-based systems hugely simplify the sharing
of data among different companies. The standard ID is the key.
When two list owners wish to combine information on common
customers, they need only compare their lists of IDs--an easier
and more accurate process than conventional matching, and one
that does not require sharing actual names and addresses. In
practice, such comparisons would be done by the reference table
vendor rather than the companies themselves, because license
agreements forbid sharing the standard IDs with outside firms.
Standard IDs provide similar efficiencies for appending data
from third-party sources to in-house lists--again, the
third-party data list is coded with the standard IDs and these
are matched against the IDs provided by the list owner. This
sort of matching could be done on a periodic basis, or list
owners could be notified when any interesting data appeared
about one of their customers. This opens up some intriguing, if
Orwellian, marketing possibilities.
In fact, the privacy implications of reference-based matching
have received relatively little public discussion. The vendors
argue these systems enhance privacy because they yield more
accurate matches and, by linking all related records, make it
easier to comply with opt-out requests. But widespread use of
the same reference table also means that any errors in that
table will be propagated widely rather than limited to a single
company's internal systems. Easier and cheaper cross-company
matching also encourages firms to share data more widely,
leading to more comprehensive customer profiles that could
easily be misused by the inept or abused by the malevolent.
Because the reference-based systems are technically designed for
matching rather than data sharing, they do not appear to be
governed by existing privacy regulations. They are affected
indirectly, however, as reduced access to data such as credit
records makes the tables themselves potentially less accurate.
As such systems are more widely understood, they may eventually
be subject to the same rules as other lists for individual
disclosure, review and opt-out. But, at least in the U.S., it's
hard to imagine any regulations being passed that significantly
diminish these systems' effectiveness.
In sum, reference-based matching is often more accurate, more
efficient and easier to deploy than merge/purge or pattern-based
matching systems. On the other hand, prices are higher than for
other technologies and some enterprises may balk at sending
their customer list to an outside vendor. But where
circumstances permit, reference-based matching is an option well
worth exploring.
* * *
Let's assume you've decided to invest some serious effort in
choosing a customer matching system. How do you go about it?
You'll start with technical specifications, like hardware,
operating systems and integration methods. These may eliminate
some contenders, but today most systems run on all the common
platforms. You might try to narrow the field further by
considering just one of the three classes of matching systems
described in previous articles--string-based, pattern-based or
reference-based. But while it's generally true that
reference-based are the most accurate and string-based are the
least accurate, simply knowing this does not mean that one class
of product is more appropriate than another. This is because the
difference in performance depends on the applications and
specific data involved. For example, a power utility's list of
current customers is likely to be quite accurate, while a list
of inactive catalog buyers will contain many duplicate accounts
and outdated addresses. If a file is highly accurate to begin
with, moving to a more powerful system may not increase
performance enough to justify the higher acquisition and
operating costs. And even if you did limit yourself to a single
class of systems, there are still significant differences among
the products within each group.
In short, there is no way to make a really sound decision
without testing each product against your own data. The process
has three main steps:
- assemble test data. This is often the hardest part of
the project because the data is not readily available and IT
resources to assemble it are scarce. Ideally, the test data
would include complete files from each system that will
eventually provide inputs. This would test the matching
system's ability to handle data gathered through different
processes and stored in different formats. It would also
provide the highest possible number of duplicates to detect.
In fact, the test data should really include several sets of
input from each system, taken at different dates. This would
ensure the data contains old and new versions of customers
who have moved, changed their names, opened or closed
accounts, and gone through other transformations the
matching system may be intended to detect.
Alas, comprehensive data is rarely available. Even if it is,
the volume is likely to be greater than the matching software
vendors are willing to include in a test. So some form of
sampling is usually necessary.
Constructing a sample for a matching test is unusually
tricky. The statistician's usual instinct is to take a random or
Nth sample--but this is exactly the worst thing to do for
matching tests. These methods tend to remove adjacent records,
which are the most likely to be duplicates or members of the
same household. A better approach is to select all names in
limited geographic area, such as a state or metropolitan region.
A relatively large area will also catch many people who have
moved, although those who entered or left the region will be
missed. More than one geographic region should be chosen to get
a mix of urban and rural areas and to include any regional
differences. This is particularly important in companies where
different areas are served by different operational systems--a
common situation at firms that have grown by acquisition. For
these companies, using multiple regions ensures that inputs from
all those systems are represented.
If the volume remains too high even when the sample is
limited to a handful of regions, it may be further reduced by
selecting on last name--say, all names beginning from A through
F. This will still include most duplicates, although it will
likely lose women who have changed their name after marriage or
divorce.
It is also worth inserting records known to contain special
situations, such as tricky parsing problems, name changes,
frequent movers, household splits, or multiple generations
(i.e., Sr., Jr. and III). These can be fictional records to test
string- and pattern-based matching, but should be real people
when testing reference-based systems. To avoid having such
records stand out during processing, they should be physically
mixed in with the other data and in exactly the same format.
This may require constructing plausible values for fields that
are populated in other records in the same file, such as account
ID or telephone number. The number of such fields should be
limited, since data not used for matching should be removed from
the test file to reduce security risks and processing costs. Any
individual or household link that comes from a system that would
be replaced by the new matching software should also be removed.
Such links should not be discarded, however, since they can
later be compared with links created by the new systems.
Each record should include a source system indicator and file
date, since the matching system might need different rules for
records from different sources or from the same source at
different times. Every record should also be assigned a unique
identifier to simplify later analysis of how the matching
systems performed.
The final step in test file preparation is creation of record
layouts and counts needed to help load the data into the
matching system itself. Some users prepare two test files: one
for initial system setup and tuning, and the other to generate
test results. This is analogous to the standard approach of
predictive modelers, who build a model on one data sample and
then validate it against a separate data set. In both modeling
and matching, the purpose is to ensure the system is not
generating unrealistical results by tuning itself to anomalies
in the test data. This is generally not an issue for matching
systems, however, so split test files are rarely used.
- run the tests. In most cases, the tests will actually be
run by the vendor. This is faster and easier than installing
the software in-house. But you will still need to provide
instructions regarding matching rules and household
definition. You also want to get some idea of the effort
involved in setting up the system. It may not be practical
to watch the vendor's staff set up your particular job,
because the work is performed in small steps by different
people over several days or weeks. But it should be possible
to walk through the operation, seeing each task performed on
whatever data happens to be active. This will give some idea
of the system features and staff skills involved. It should
also be possible to get statistics on the computer resources
and staff time consumed in working on your job.
- compare the results. Each system will have its own
standard reports. Data conversion, standardization and
parsing will generate statistics on missing data elements,
address corrections, postal coding, and similar items.
Individual records are sometimes coded to show the exact
changes that were applied. This makes it easy to find
records that had specific types of changes, so you can
verify their accuracy. The matching portion of the system
will show the number of records input, number of unique
individuals identified, and (usually) number of unique
households. Most system also classify the matches, either by
certainty level or by the reason they were considered to
match. The systems should also provide listings of records
that were matched, again typically grouped by category.
Visual inspection is very useful for string- and
pattern-based matching, but less helpful when
reference-based systems bring together records that are
superficially unrelated.
While the most obvious statistic to compare across systems is
the number of matches found, it is important to realize that
matches may be incorrect--so a higher match rate is not
necessarily a better result. In fact, there are three statistics
to balance: correct matches, incorrect matches, and missed
matches. Unfortunately, the "truth" is usually not known for all
matches on a file, with the important exception of test cases
inserted for this very reason. So the primary method of
comparing systems is to look for situations where one system has
identified a match and another system hasn't, and to determine
which system is correct. This misses situations where all
systems have made the same error. But it does allow a meaningful
comparison of the different systems to each other.
Identifying the disagreements among systems requires getting
a file from each vendor with the original data plus whatever
individual and household IDs have been assigned to link records
that match. Since each record will also contain its original
unique ID, the files can be joined to allow comparison. The
comparison report takes a bit of work to create, although some
matching vendors have written programs to do it automatically.
Except when the correct answer is known because of test cases
or pre-researched linkages, judging which system is correct
about any given match is a challenge of its own. Users mostly
rely on a visual comparison, particularly where string- and
pattern-based systems are involved. In some situations, users
actively research the questionable matches via telephone calls
or other validation methods.
Once the relative accuracy of the different systems has been
established, there is still a business analysis to be done. This
weighs the costs of the different systems against the values of
found, missed and false matches. These values depend on the
business situation--a false match has little cost when sending a
clothing catalog, but could cause a lawsuit where financial
accounts are concerned. Such priorities should be discussed with
vendors in advance, since most systems can be tuned to adjust
the balance between false hits and misses.
While accuracy and business value will be the primary factors
in selecting a matching system, they are not the only ones. Some
buyers reject reference-based systems because they require
off-site processing and on-going service relationships. Some
focus on processing speed, or computer resource consumption, or
the staff effort required. Some care deeply about the quality of
reports, options to review and override questionable matches, or
control over matching rules and reference tables. Some need to
handle international data or perform complex transformations.
Nearly every decision is affected by salesmanship, customer
service and vendor background.
Systems differ significantly along all these dimensions.
Unfortunately, too many buyers focus on these other issues and
neglect to test the performance of the software itself. Given
the major differences in accuracy among the different products,
this can be a big mistake.
David M. Raab is president of ClientXClient, a consulting
and software firm specializing in customer value optimization.
He can be reached at
draab@clientxclient.com.
More News & Reviews
Copyright 2001 CLIENTxCLIENT Contact:
info@CLIENTxCLIENT.com
Call: 908.350.3012
|
|