<%@ Language=VBScript %> Peak Data Solutions, Inc. - What's New
Case Study

By Randall T. Smith

Executive Vice President.

Copyright ©2001 All rights reserved

In today’s competitive market place, companies are constantly trying to better understand their customers and predict their needs in order to build, increase, and maintain market share. Some organizations have databases filled with information about their customers, and if they do not have the data in-house they can purchase customer data from third parties, such as Acxiom, MetroMail, Polk, CACI, and others. With all this data readily available, modeling can be utilized to sift through all this data to better understand customers and their behaviors. Marketers utilize data modeling to predict attributes such as customer satisfactions, purchasing levels, attrition, loyalty, response, etc.  This paper describes the main steps to follow when building a model. The intent of this document is not to describe the one-stop, instant solution because unfortunately there isn’t one. There are libraries filled with books on how to data mine and model. Instead, the intent of this document is to provide a better understanding of what is involved in the modeling process, what steps should be take to properly implement a model, and how to ensure that the predictive model will meet your business needs.

Constructing a Model in 11 Steps

1. Define the business objective for building and applying a predictive model
2. Identify in-house data and consider purchasing third-party data
3. Identify the mining and modeling software required
4. Data standardization and hygiene
5. Identify observation period and outcome/dependent variable
6. Sampling
7. Identify key variables/attributes of individuals within the database (Association)
8. Determine which key best predict a given behavior (Optimization)
9. Assess, measure, and evaluate predictive techniques
10. Apply the predictive methodologies to meet your overlay business objectives
11. Measure results


Define the business objective for building and applying a predictive model
The first step in data mining and modeling is to clearly define your business objectives and how the application will be utilized. There are many different data mining and modeling techniques and tools. Once you have finalized what you want to get out of the model and how it will be used, you can better determine which tools you will need. For example, when building a customer satisfaction index, do you need to know if a customer is "satisfied" versus "not satisfied" (dichotomous model) or do you need to know the degree at which a customer is satisfied (continuous model).

Some questions to consider before development include:

• What is the budget for developing this application?
• What is the overall timeline for completion?
• How will it be used: to better understand the customer or to set up a direct-to-customer intervention/interaction?
• Are you trying to predict a dichotomous or continuous outcome?
• Are you trying to increase response rate to a campaign?
• Are you trying to target a select population (i.e. loyal versus prospects) for some type of intervention?
• Are you trying to predict a customer action, such as attrition, cost, or purchases?
• How often will the model score be applied: once, quarterly, real-time?
• Is there a penalty for misclassifying a customer, such as, credit scoring or fraud detection?
• Once you define and outlined the goals of the modeling application, it will make the entire development process run smoother, and it will ensure that your objectives are optimally met.


Identify in-house data and consider purchasing third-party data
Since data is the most critical element needed in order to develop a predictive model, you need to take inventory and determine what database elements you have or need to acquire. You can also, consider purchasing third party area and/or household level data, such as Polk, SMACs, CACI, Census, InfoBase.

Some of the data sources to consider are:

• Outcome
• Transactional
• Customer Demographic (internal)
• Demographics (external)
• Outbound Interaction
• Inbound Interaction
• Survey
• Response
• (Web) Log
• Click Stream

The following list of questions will determine which data elements to include:

What data element will be used as the dependent/predicted outcome variable?

Are there enough data elements to substantiate the building of a predictive model?

Will the data that is utilized be available after the model is developed?

What internal data and elements are available?

What external, third-party data is available, and do you have the budget, time, and internal data (i.e. name and full address, DMA, Block Group, and/or Zip Code) necessary to append external data?

Are there erroneous or unusable variables that should not be considered in the data mining process?

Is there an accurate data dictionary or someone who is knowledgeable about the various data sets and their corresponding data elements?

How "clean" is the data?

How well are the data fields populated?

During what time period is the data available?

What common "keys" are available to merge the various data sets together?


Identify the mining and modeling software you have in-house or will need to purchase

If you do not already have a preferred mining and modeling software in-house, this might be the most difficult step of all.  There are many different products on the market, and their cost can range from $2,000 for a limited desktop tool to $500,000 for a robust server application with a suite of mining modules.  Two commonly used modeling software companies are SAS and SPSS.  There are many elements to consider when purchasing an application from staffing to processing. The following is a list of items to consider when deciding on a data-modeling tool:


• Manpower: Do you have the staff necessary to develop a predictive model in-house or will you have to hire addition staff or outsource the project?  Will your IT department have to test the software before it can go live on your network? 

• Training: Will you have to train your staff on how to use the software? How

long will it take for a member to become proficient at using the software?

• Support: Who will provide technical support for the software from installation to applications?

• Consulting: Are consultants available that can help you utilize the software?


• Availability: Is the software available, and will upgrades be available in the future?

• Cost: How much do you want to spend on the software?

• Accuracy: How sensitive is the software to missing values, outliers, noise, and misclassification costs?

• Scalability: Is the software limited to certain number of records or variables, or will it have room to grow as your database becomes larger?

• Platforms: Does the software company provide a version of the software that will install on the platform (i.e. Unix, NT, etc.) you need?

• Character Processing: Will the software application process both numeric and character independent variables?

• Performance: How fast and accurate is the software? Will it take minutes to run and validate a model or will it be an overnight process?

• Memory Management: How much memory is needed to support the software and how does the software manage its resources?

• Tools/Modules: Does the software have the tools and modules you will need in order to mine the data and construct a model using various techniques? Is the software limited to one modeling technique or will it support multiple?

• Import/Export Data: Does the software have an easy import and export engine in order to move the training data set in and out the modeling process?

• Formats: Will the software support various formats of the data?

• Data processing: Does the software allow you to pre-process the data?


Data standardization and hygiene

One of the most time consuming processes of building a predictive model is getting the data in a usable format from which you can mine and model it.  Most likely you will have to marry several databases together from various platforms and formats into one harmonious training data set. Transactional data may have to be summarized; erroneous values should generally be recoded to unknown or missing/null; you may have to re-classify entire data fields. In addition, some modeling techniques will only process numeric and/or continuous attributes. In these cases you may need to change a character gender variable from "F"emale, "M"ale, and "U"nknown into three numeric variables, such as female (0,1), male (0,1), and unknown (0,1), where the value 0 is no and 1 is yes. Other techniques will process a combination of both numeric and character variables. 

In additions to standardizing and cleaning the data elements, you may want to create some of your own independent variables. For example, you may decide to cross two variables, such as age and gender, in order to form a new variable that further defines a predictive crossing, such as old males, young males, old females, and young females. Independently, age and gender may not be predictive in defining an outcome, but a combined variable may prove effective (e.g. older males and younger females may be likely to rate high on a customer satisfaction index, while younger males and older females may be likely to rate low).


Identify observation period and outcome / dependent variable

Another critical step in the data mining and modeling process is to determine the modeling observation and outcome periods. The observation period is when a given behavior is observed and captured as a behavior characteristic.  The outcome period is the time when the outcome of interest is observed. One important question that you might ask yourself: "Is your performance variable seasonally influenced, or will you be implementing the model within a certain period?" For example, if you are trying to predict what population is likely to respond to an e-mail campaign for a discount on toys, and you want to implement the program during the Christmas season, you naturally would not want to use only the Spring and Summer as your observation and performance window.  Shoppers have different buying patterns depending on the season. If, however, you get too specific with your performance period, the model might not be useful during other periods, leaving you with a model that can only be implemented during one part of the year and remains furloughed for the rest. 

Defining your outcome/dependent variable is another key step in developing a model. For example, if you want to observe and identify loyalty in your performance window, how should loyalty be defined (e.g. a certain threshold of purchases or dollar amount)?



Due to the processing time of building and validating a model and some of the limitations of certain software applications, large data sets can be reduced to smaller more manageable data sets through sampling.  For example, if you had a database of a million people and you were trying to build a model that predicts customer loyalty, you should definitely sample the million people down to a more reasonable size in order to speed the model development and validation time. In addition, if only 10,000 people are considered loyal and the remaining 990,000 are not loyal, then you would be wasting resources by not sampling the 990,000 down to a more manageable size.

When sampling, you generally pick a purely random population that is representative of the entire database. To confirm, you can perform various significance tests on key attributes to determine if there are differences between the sample and the entire population.

In addition, there are times when a purely random sample may not be effective.  For example, you may want to predict fraudulent activities, but the behavior is very rare when compared to non-fraudulent activities. In cases like this, if you randomly sampled from the entire population, you may end up with only a few cases of fraud in your sample, leaving you nothing to predict. In this case, you need to "over-sample" the fraudulent group for the training data set and "weight" the segment to reflect their proportional presence within the entire population. This technique is called stratified sampling.


Identify key variables / attributes of individuals within the database (association)

The next step is to determine which variables are associated with the outcome definition that you define. When you have a training data set with thousands of independent variables, you have to mine through all the various variables to determine which ones should be further considered for the modeling process.  In addition, you should examine which dependent variables are associated with each other.  Listed below are a few of the data mining techniques that are often utilized:

Correlation Tests
Chi-Squared Tests
Visualization (Graphs)
Segmentation Analyses
Information Values (Kullback)
Factor Analysis
Decision Tree


Determine which keys best predict a given behavior (optimization)

During an observation period you want to develop predictive variables that will delineate the behavior observed within the outcome period.  There are a number of techniques that can be used to identify potential predictive variables (e.g. chi-squared values, information values, etc.). Once potential variables and attributes are identified, they can then be incorporated into a model that predicts an outcome or value. The purpose of the model is to "best" summarize all the potential variables and attributes into a single solution that can easily be implemented. Some modeling paradigms include Regression, Decision Tree, CART, CHAID, Cluster, Neural Networks, and Genetic Algorithms.

The method you choose will depend on the software application you are using, the experience of your staff, any time constraints, and the outcome you are trying to predict. If possible, you should develop several different models utilizing various techniques, and then compare the results to determine which technique provides the "best" solution.


Assess, measure, and evaluate predictive techniques

One of the most critical steps in the modeling process is to evaluate and assess how the model will perform on the general population (i.e. the population outside your training data set). One common error when building a predictive model is to "over-fit" the model. This occurs when you try too hard to put predictive characteristics in a model that are just random variations within the training data set. So when you try to use the model outside your training data set it becomes ineffective.

One way to validate a model is to hold out a random sample. This "validation sample" is a random subset of the training data set that is only used to evaluate how the model will perform on a population that was not used in the development cycle. This validation sample will test the overall robustness of the model. Also, if you build models using different techniques, you can easily identify which ones work best with a validation sample. In some cases, there may not be enough records present in order to maintain a validation sample. In these cases, there are other validation techniques, such as jack knifing and bootstrapping, that can be incorporated. These techniques use the training data set and unique algorithms to validate the model so that you do not have to maintain a holdout sample. 


Apply the predictive methodologies to meet your overlay business objectives
Once your model is developed and validated, you will need to implement the model to meet your objectives. For example, if the business objective is the mail the top 100,000 likely responders with a direct mailing, your model will need to be transformed into an algorithm to allow the selection to occur. This algorithm is usually referred to as "scoring" the data set. The scores generally rank order the likelihood of an outcome (e.g. Member A is 90% likely to respond, while Member B is only 20% likely). 

Another thing to consider is how often the scores should be updated.  The frequency of updating depends on the business objective.  For example, if you are trying to predict fraudulent activities, you probably need to score out the data set in real-time, while if you are trying to predict which customer is likely to respond to a campaign, you may only need to update the scores monthly or quarterly. Updating also depends on the frequency at which your database changes.  In addition, if you database changes drastically over time, you may need to consider reevaluating and/or reconstructing the predictive model.


Measure results

Once you decide to apply the model, you should considering designing your business strategy so that you can measure "actual" results on the back-end.  The most common methodology for measuring results is to configure a test and control design.  For example, if you wanted to test how well a model worked at identifying responders, you can mail a randomly selected control population regardless of the model score, and then mail your top model selected population, and compare the varying responds rates of both populations.

Developing measurement strategies are often overlooked, but are critical to understand how well a model and/or program is working and whether or not improvements can be made.  If possible, always try to measure the results of a model.


Request Brochures/More Information:
Please contact us at: 650-363-7236
* Requires the Adobe Acrobat Reader Plug-In

© Peak Data Solutions, Inc., All Rights Reserved