Batch vs. Real-Time Technologies
by David M. Raab
Relationship Marketing Report
December, 1999
The last two columns in this series have looked at ways to segment the
universe of marketing-related systems. Although no fully
satisfactory scheme has emerged, one distinction was present in
nearly every attempt: batch vs.and real-time systems. The
general argument was that the technologies needed for these two
types of systems are so radically different that they need to be
treated separately.
This proposition is worth closer examination--both to
understand the nature of the technical differences, and to see
how some systems manage to bridge the gap.
First, let's get the definitions straight. Batch systems
execute a sequence of steps without external inputs, while
real-time systems wait for user input between steps in a
transaction. Batch systems typically apply the same
process--such as calculating a model score or assigning a
customer segment--to many records in a single job, while
real-time systems typically execute a process against a single
record per job.
These differences in function result in different goals for
system design. For a batch system, the key goal is to move
through a large data set as efficiently as possible. The goal
for a real-time system is to retrieve and update individual
records with minimum delay.
Although batch systems usually process large numbers of
records, they generally work with one record at a time: they
read the record and its associated data, process it, store the
outcome, and then repeat the process for the next record.
Efficiency is determined primarily by the time it takes to
assemble all the data needed to process each record. In a flat
file system, this is done by either combining data from multiple
sources into a single record before the process begins, or by
sorting multiple files in the same sequence so the system can
step through them in parallel without extensive searching. This
sort of sequential processing is especially well suited to files
stored on tapes rather than disk drives, since it allows the
system to physically read the records in the sequence they
appear on the tape. If the processing were not sequential, then
the system would have to search for each set of records from one
of the tape to the other. (Remember all those images of spinning
tapes from TV shows and movies in the 1960's and 70's? That's
what was going on.)
In contrast, a relational database is explicitly designed not
to place records in a specific sequence. Instead, relational
systems rely on indexes to link the related data and typically
load the data itself onto disk drives that can quickly access
records that are not physically adjacent. Still, because
sequential access is inherently more efficient than even the
fastest disk drive, many of the largest-volume batch systems
create an ordered extract that is then processed like a flat
file.
Relational systems also often improve efficiency by
"denormalizing" the data, which means storing the same piece of
information in more than one record. This violates a cardinal
rule of relational database design, which says each item should
be stored only once. The rule exists to ensure data consistency
and speed updates. But violating it will reduce the number of
tables that must be searched and read to process a record. This
can yield major performance gains.
Batch systems can get away with denormalization and
sequential processing because they are not subject to the same
constraints as real-time systems. Most real-time systems don't
know which record will be needed next, because they are reacting
to unpredictable events such as which customer will place an
order or call for service. Therefore the real-time systems need
search mechanisms like indexes on account numbers, which allow
them to find any particular record quickly. By contrast, a batch
system will eventually process all records in its set, so has no
particular need to locate a specific record first. Real-time
systems also must be kept internally consistent at all times,
since two transactions relating to the same account might occur
almost simultaneously, and different kinds of transactions might
occur in different sequences. This makes it much more dangerous
for real-time systems to violate the relational principal of
"normalization"--storing each piece of information only
once--than for batch systems, which exist in a much more
controlled environment. Similarly, real-time systems are also
more focused on the update speed that normalized designs
provide.
So, to oversimplify a bit, batch systems use sequential
processing and denormalized data structures (few tables with
some redundant data), while real-time systems use indexes,
random access and normalized structures (many tables with no
redundant data). While it's possible for one system to do both,
most software is optimized for one or the other. This is why the
distinction is so fundamental when attempting to classify
different marketing products.
Specifically, traditional data warehouses and database
marketing systems tend to use batch processing techniques--after
all, most queries are looking for patterns or segments in the
entire database, not picking out a single customer or account.
By contrast, front-office systems for customer service, sales
automation or contact management are real-time systems that must
be designed to work with one customer at a time.
The problem, of course, is that today's goal is to merge the
back-end marketing database with the front-office customer
contact system. This lets users define customer strategies in
the back-end system--which has the rich history data and
analytical capabilities--and execute the strategies in the
front-office system during the real-time interactions. So
designers are being asked to make one system handle both batch
and real-time processing.
As with most computer processing challenges, there are two
basic solutions: brute force and elegant design. Given the
continued drop in hardware costs, brute force is often the best
approach. But in some situations, elegant design is still worth
the effort.
In dealing with real-time marketing systems, the classic
application of brute force is parallel processing. This involves
systems that split a single batch job into many smaller jobs and
run them all simultaneously. IBM's SP2 and NCR's Teradata are
the most common examples of massively parallel systems, although
other vendors have products as well.
Massively parallel systems do have the ability to give high
performance on both batch and real-time jobs. But the hardware
is expensive and developers must usually tune the application
software and data structure for optimum performance.
This tuning is costly and time-consuming, which is bad
enough. But it also means that the resulting system may perform
poorly when faced with unanticipated demands. For example, one
common tactic in parallel system design is to store data from
different date ranges on separate hard drives (each served by
its own processor). This works great when queries look across
all date ranges, since the different processors can work on the
different date ranges simultaneously. But if queries suddenly
focus on a single date range, the system will slow considerably
because only one processor can access the necessary data.
(Reality is a bit less grim, since parallel systems can usually
give several processors access to the same data if necessary.
But performance will still suffer.)
A newer brute force approach involves "main memory"
databases, which essentially move the underlying data from a
disk drive into high speed, random access memory. Specialized
database management systems that do this include TimesTen (www.timesten.com)
and Angara Data Server (www.angara.com). These systems can
access records ten to twenty times faster than if the data were
stored on a disk drive; they can also employ specialized indexes
that reduce performance impact of bringing together related
records from many different tables. The most important current
application of this technology is managing Internet
interactions, where systems may need to access huge volumes of
data in real time. But the fast access provided by the main
memory systems allows them to complete batch processes extremely
quickly as well.
For companies that are unable or unwilling to apply brute
force solutions, the alternative is a system design based on
conventional technology. Since the same conventional data tables
generally cannot provide adequate performance for both real-time
and batch tasks, this usually involves maintaining separate data
tables for the two types of applications, and somehow
synchronizing them. The simplest approach is to first load all
data into a conventional marketing database--structured for
batch processing--and periodically create extracts that are
structured for access by real-time systems or feed data into the
real-time systems' own tables. The problem with this method is
that batch processes are used to update the conventional
database and to generate the extracts. This means the marketing
system cannot feed adjusted information as a transaction occurs.
So the marketing feed itself is something less than real-time.
A slightly more sophisticated approach is to update the table
that supports the real-time systems at the same time that the
main marketing database is updated. This avoids the lag due to
batch extracts, but still must wait for the batch updates of the
main database. The only way to avoid this second lag is to
update the real-time table directly, rather than filtering data
through the main marketing system first. Some
systems--particularly those designed for Internet marketing--do
maintain a profile database that is updated in real time in this
fashion. In addition to simply capturing the new transaction,
such a system might recalculate derived values such as
cumulative purchases and model scores, and use the adjusted
values in managing the interaction. The new data would be
periodically added to the main marketing database during its
regular batch update. This sort of synchronization is about the
best that can be done with conventional technology.
As marketers continue to integrate real-time front-office
systems with batch-oriented marketing databases, vendors will
face increasing pressure to combine batch and real-time
processing in a single system. As we've seen, this is a
difficult task using today's standard (relational) technologies.
Buyers looking for an integrated system should look carefully at
each vendor's approach to this challenge, to ensure the system
they purchase will meet both current and future needs.
David M. Raab is president of ClientXClient, a consulting
and software firm specializing in customer value optimization.
He can be reached at
draab@clientxclient.com.
More News & Reviews
Copyright 1999 CLIENTxCLIENT Contact:
info@CLIENTxCLIENT.com
Call: 908.350.3012
|