ELEVEN STEPS TO BUILDING A PREDICTIVE MODEL
By Randall T. Smith
- Executive Vice President.
Copyright ©2001 All rights reserved
SummaryIn todays competitive market place, companies are constantly
trying to better understand their customers and predict their needs in
order to build, increase, and maintain market share. Some organizations
have databases filled with information about their customers, and if
they do not have the data in-house they can purchase customer data from
third parties, such as Acxiom, MetroMail, Polk, CACI, and others. With
all this data readily available, modeling can be utilized to sift
through all this data to better understand customers and their
behaviors. Marketers utilize data modeling to predict attributes such as
customer satisfactions, purchasing levels, attrition, loyalty, response,
etc. This paper describes the main steps to follow when building a
model. The intent of this document is not to describe the one-stop,
instant solution because unfortunately there isnt one. There are
libraries filled with books on how to data mine and model. Instead, the
intent of this document is to provide a better understanding of what is
involved in the modeling process, what steps should be take to properly
implement a model, and how to ensure that the predictive model will meet
your business needs.
Constructing a Model in 11 Steps
- 1. Define the business objective for building and applying a
- 2. Identify in-house data and consider purchasing third-party
- 3. Identify the mining and modeling software required
- 4. Data standardization and hygiene
- 5. Identify observation period and outcome/dependent variable
- 6. Sampling
- 7. Identify key variables/attributes of individuals within the
- 8. Determine which key best predict a given behavior
- 9. Assess, measure, and evaluate predictive techniques
- 10. Apply the predictive methodologies to meet your overlay
- 11. Measure results
Define the business objective for building and applying a predictive
The first step in data mining and modeling is to clearly
define your business objectives and how the application will be
utilized. There are many different data mining and modeling techniques
and tools. Once you have finalized what you want to get out of the model
and how it will be used, you can better determine which tools you will
need. For example, when building a customer satisfaction index, do you
need to know if a customer is "satisfied" versus "not satisfied"
(dichotomous model) or do you need to know the degree at which a
customer is satisfied (continuous model).
Some questions to consider before development include:
What is the budget for developing this application?
What is the overall timeline for completion?
How will it be used: to better understand the customer or to set up a
Are you trying to predict a dichotomous or continuous outcome?
Are you trying to increase response rate to a campaign?
Are you trying to target a select population (i.e. loyal versus
prospects) for some type of intervention?
Are you trying to predict a customer action, such as attrition, cost,
How often will the model score be applied: once, quarterly, real-time?
Is there a penalty for misclassifying a customer, such as, credit
scoring or fraud detection?
Once you define and outlined the goals of the modeling application, it
will make the entire development process run smoother, and it will
ensure that your objectives are optimally met.
Identify in-house data and consider purchasing
Since data is the most critical element needed in
order to develop a predictive model, you need to take inventory and
determine what database elements you have or need to acquire. You can
also, consider purchasing third party area and/or household level data,
such as Polk, SMACs, CACI, Census, InfoBase.
Some of the data sources to consider are:
Customer Demographic (internal)
The following list of questions will determine which
data elements to include:
What data element will be used as the
dependent/predicted outcome variable?
Are there enough data elements to substantiate the
building of a predictive model?
Will the data that is utilized be available after the
model is developed?
What internal data and elements are available?
What external, third-party data is available, and do you
have the budget, time, and internal data (i.e. name and full address,
DMA, Block Group, and/or Zip Code) necessary to append external data?
Are there erroneous or unusable variables that should
not be considered in the data mining process?
Is there an accurate data dictionary or someone who is
knowledgeable about the various data sets and their corresponding data
How "clean" is the data?
How well are the data fields populated?
During what time period is the data available?
What common "keys" are available to merge the various
data sets together?
Identify the mining and modeling software you have
in-house or will need to purchase
If you do not already have a preferred mining and modeling software
in-house, this might be the most difficult step of all. There are
many different products on the market, and their cost can range from
$2,000 for a limited desktop tool to $500,000 for a robust server
application with a suite of mining modules. Two commonly used
modeling software companies are SAS and SPSS. There are many
elements to consider when purchasing an application from staffing to
processing. The following is a list of items to consider when deciding
on a data-modeling tool:
Manpower: Do you have the staff necessary to develop a
predictive model in-house or will you have to hire addition staff or
outsource the project? Will your IT department have to test the
software before it can go live on your network?
Training: Will you have to train your staff on how to
use the software? How
long will it take for a member to become proficient at
using the software?
Support: Who will provide technical support for the
software from installation to applications?
Consulting: Are consultants available that can help
you utilize the software?
Availability: Is the software available, and will
upgrades be available in the future?
Cost: How much do you want to spend on the software?
Accuracy: How sensitive is the software to missing
values, outliers, noise, and misclassification costs?
Scalability: Is the software limited to certain number
of records or variables, or will it have room to grow as your database
Platforms: Does the software company provide a version
of the software that will install on the platform (i.e. Unix, NT, etc.)
Character Processing: Will the software application
process both numeric and character independent variables?
Performance: How fast and accurate is the software?
Will it take minutes to run and validate a model or will it be an
Memory Management: How much memory is needed to
support the software and how does the software manage its resources?
Tools/Modules: Does the software have the tools and
modules you will need in order to mine the data and construct a model
using various techniques? Is the software limited to one modeling
technique or will it support multiple?
Import/Export Data: Does the software have an easy
import and export engine in order to move the training data set in and
out the modeling process?
Formats: Will the software support various formats of
Data processing: Does the software allow you to
pre-process the data?
Data standardization and hygiene
One of the most time consuming processes of building a predictive model
is getting the data in a usable format from which you can mine and model
it. Most likely you will have to marry several databases together
from various platforms and formats into one harmonious training data
set. Transactional data may have to be summarized; erroneous values
should generally be recoded to unknown or missing/null; you may have to
re-classify entire data fields. In addition, some modeling techniques
will only process numeric and/or continuous attributes. In these cases
you may need to change a character gender variable from "F"emale, "M"ale,
and "U"nknown into three numeric variables, such as female (0,1), male
(0,1), and unknown (0,1), where the value 0 is no and 1 is yes. Other
techniques will process a combination of both numeric and character
In additions to standardizing and cleaning the data
elements, you may want to create some of your own independent variables.
For example, you may decide to cross two variables, such as age and
gender, in order to form a new variable that further defines a
predictive crossing, such as old males, young males, old females, and
young females. Independently, age and gender may not be predictive in
defining an outcome, but a combined variable may prove effective (e.g.
older males and younger females may be likely to rate high on a customer
satisfaction index, while younger males and older females may be likely
to rate low).
Identify observation period and outcome / dependent
Another critical step in the data mining and modeling process is to
determine the modeling observation and outcome periods. The observation
period is when a given behavior is observed and captured as a behavior
characteristic. The outcome period is the time when the outcome of
interest is observed. One important question that you might ask
yourself: "Is your performance variable seasonally influenced, or will
you be implementing the model within a certain period?" For example, if
you are trying to predict what population is likely to respond to an
e-mail campaign for a discount on toys, and you want to implement the
program during the Christmas season, you naturally would not want to use
only the Spring and Summer as your observation and performance window.
Shoppers have different buying patterns depending on the season. If,
however, you get too specific with your performance period, the model
might not be useful during other periods, leaving you with a model that
can only be implemented during one part of the year and remains
furloughed for the rest.
Defining your outcome/dependent variable is another key
step in developing a model. For example, if you want to observe and
identify loyalty in your performance window, how should loyalty be
defined (e.g. a certain threshold of purchases or dollar amount)?
Due to the processing time of building and validating a model and some
of the limitations of certain software applications, large data sets can
be reduced to smaller more manageable data sets through sampling.
For example, if you had a database of a million people and you were
trying to build a model that predicts customer loyalty, you should
definitely sample the million people down to a more reasonable size in
order to speed the model development and validation time. In addition,
if only 10,000 people are considered loyal and the remaining 990,000 are
not loyal, then you would be wasting resources by not sampling the
990,000 down to a more manageable size.
When sampling, you generally pick a purely random
population that is representative of the entire database. To confirm,
you can perform various significance tests on key attributes to
determine if there are differences between the sample and the entire
In addition, there are times when a purely random sample
may not be effective. For example, you may want to predict
fraudulent activities, but the behavior is very rare when compared to
non-fraudulent activities. In cases like this, if you randomly sampled
from the entire population, you may end up with only a few cases of
fraud in your sample, leaving you nothing to predict. In this case, you
need to "over-sample" the fraudulent group for the training data set and
"weight" the segment to reflect their proportional presence within the
entire population. This technique is called stratified sampling.
key variables / attributes of individuals within the database
The next step is to determine which variables are associated with the
outcome definition that you define. When you have a training data set
with thousands of independent variables, you have to mine through all
the various variables to determine which ones should be further
considered for the modeling process. In addition, you should
examine which dependent variables are associated with each other.
Listed below are a few of the data mining techniques that are often
- Correlation Tests
- Chi-Squared Tests
- Visualization (Graphs)
- Segmentation Analyses
- Information Values (Kullback)
- Factor Analysis
- Decision Tree
which keys best predict a given behavior (optimization)
During an observation period you want to develop predictive variables
that will delineate the behavior observed within the outcome period.
There are a number of techniques that can be used to identify potential
predictive variables (e.g. chi-squared values, information values,
etc.). Once potential variables and attributes are identified, they can
then be incorporated into a model that predicts an outcome or value. The
purpose of the model is to "best" summarize all the potential variables
and attributes into a single solution that can easily be implemented.
Some modeling paradigms include Regression, Decision Tree, CART, CHAID,
Cluster, Neural Networks, and Genetic Algorithms.
The method you choose will depend on the software
application you are using, the experience of your staff, any time
constraints, and the outcome you are trying to predict. If possible, you
should develop several different models utilizing various techniques,
and then compare the results to determine which technique provides the
measure, and evaluate predictive techniques
One of the most critical steps in the modeling process is to evaluate
and assess how the model will perform on the general population (i.e.
the population outside your training data set). One common error when
building a predictive model is to "over-fit" the model. This occurs when
you try too hard to put predictive characteristics in a model that are
just random variations within the training data set. So when you try to
use the model outside your training data set it becomes ineffective.
One way to validate a model is to hold out a random sample. This
"validation sample" is a random subset of the training data set that is
only used to evaluate how the model will perform on a population that
was not used in the development cycle. This validation sample will test
the overall robustness of the model. Also, if you build models using
different techniques, you can easily identify which ones work best with
a validation sample. In some cases, there may not be enough records
present in order to maintain a validation sample. In these cases, there
are other validation techniques, such as jack knifing and bootstrapping,
that can be incorporated. These techniques use the training data set and
unique algorithms to validate the model so that you do not have to
maintain a holdout sample.
Apply the predictive methodologies to meet your overlay business
Once your model is developed and validated, you will need to implement
the model to meet your objectives. For example, if the business
objective is the mail the top 100,000 likely responders with a direct
mailing, your model will need to be transformed into an algorithm to
allow the selection to occur. This algorithm is usually referred to as
"scoring" the data set. The scores generally rank order the likelihood
of an outcome (e.g. Member A is 90% likely to respond, while Member B is
only 20% likely).
Another thing to consider is how often the scores should
be updated. The frequency of updating depends on the business
objective. For example, if you are trying to predict fraudulent
activities, you probably need to score out the data set in real-time,
while if you are trying to predict which customer is likely to respond
to a campaign, you may only need to update the scores monthly or
quarterly. Updating also depends on the frequency at which your database
changes. In addition, if you database changes drastically over
time, you may need to consider reevaluating and/or reconstructing the
Once you decide to apply the model, you should considering designing
your business strategy so that you can measure "actual" results on the
back-end. The most common methodology for measuring results is to
configure a test and control design. For example, if you wanted to
test how well a model worked at identifying responders, you can mail a
randomly selected control population regardless of the model score, and
then mail your top model selected population, and compare the varying
responds rates of both populations.
measurement strategies are often overlooked, but are critical to
understand how well a model and/or program is working and whether or not
improvements can be made. If possible, always try to measure the
results of a model.