We use cookies and other similar technologies (Cookies) to enhance your experience and to provide you with relevant content and ads. By using our website, you are agreeing to the use of Cookies. You can change your settings at any time. Cookie Policy.

Predictive Modeling at the Transaction Level A Simple Policy Component

Originally published November 3, 2009

The Case

This is a case study of a simple “policy component,” plug-in software to automate, streamline and efficiently apply predictive and risk models into the work flow. A policy component is a small piece of plug-in software which implements institutional policy. The definition and description should become clear in the case study.

A large insurance company processes thousands of scanned documents in a day. Operators are trained to work specific kinds of transactions; many of these go into several exception queues before they are either completed or discarded. The work is partially automated, but there are many manual touch points because of technical difficulties with the documents, typographical errors, and incomplete policy information.

The insurance company operates as a servicing agent between mortgage holders, mortgage banks, and insurance companies of record, all of whom issue transactions either from branch offices or (a fourth element to the puzzle) independent agents. The incoming transactions must of course have accurate and conclusive identification of all parties. Operators are evaluated on volume and are expected to process more than one hundred transactions an hour.

There are many places where this complex work flow can break down. A common one, indeed the most common one, is at “payee code.” The payee code refers to the office, or merely to the department within the office, from or to which the transaction flows. Each mortgage bank refers to these offices with its own coding system; so the same office (for example a State Farm agent in Des Moines, Iowa) may be identified in numerous ways, depending on whether the mortgage banker is Fannie Mae, Wells Fargo, or Bank of America. Of course the office does not change, but the code itself may be entirely different depending on the banker.

The payee code is, therefore, a crucial but slippery bit of data required for a transaction to be completed. And very often this code does not come in with the transaction, or if it does, it may be incomplete or inaccurate. If the office issues the transaction, it may not know what code the banker uses to identify it. If the bank issues the transaction, it may simply neglect to enter the information or get it wrong. The payee code, for whatever variety of reasons, ends up being a major problem in pushing transactions efficiently through the system. The company could well spend several thousands of dollars each week tracking down payee codes.

Most transactions come into the office as physical paper and are scanned into images and translated into text. The image is available to the operator, and the text is used by the software to prepare the transaction even further. For example, other missing or incorrect data elements frequently include the borrower’s loan or policy number. The borrower’s name, address, and social security number are, however, far more likely to be correct and complete. This information often makes it possible to query a database and fill in the loan and policy numbers. A loan or policy number acquired in this way may need to be verified, and the transaction is flagged accordingly for the operator to intervene as necessary.

The Policy Component

Software preparation and intervention is, in other words, already part of the established process. It turned out that similar help was possible for the payee code with a simple predictive model. Policy numbers sometimes have specific formats which may be very informative. For example, some carriers embed groups of letters in their policy numbers; the policy 6132HP200809 is likely to be identifiable by its pattern of digits and letters. Indeed, the policy number was quite plausibly designed just so it could be easily identified.

One problem in the bulk processing center is that there are so many such patterns flying around that it takes a very experienced and talented operator indeed to keep very many of them in mind, as other pressures are also being exerted. An expert operator might recognize the pattern, know the mortgage banker quite well, recall that this is not an EDI transaction, and realize that, therefore, the missing payee code is most likely 61074. But relying on that expert knowledge is not especially efficient if something better is available.

And something better is available. We have learned already that a policy number of four digits, followed by two letters, followed by six digits is for this mortgage banker either guaranteed or quite likely to be (specifically) Nationwide Mutual. This sort of knowledge is similar to the knowledge that a particular borrower name and address imply a particular loan and policy. The knowledge about the payee code, however, is shakier than the loan and policy number knowledge. But logically there is very little difference. Here is the response of a prototype installation for this policy number: 6132HP200809.

     2 scores for 6132HP200809 


 Office Score 

The same office is involved in this case, but the distinction between manual and electronic transfer (which the different payee codes indicate) is not built into the model. This lack of certainty needed to be recognized and dealt with. It might well be that a policy pattern applies to several payees. It is also true that some patterns of policy numbers carry more “information” about the payee than others do. For example, a nine digit policy number with no letters or special characters (921228387) might apply to several carriers as well as numerous payee codes, as in this example:

     7 scores for 921228387     

Code   Office  Score
 60923  FI INS EXCHANGE  413
 60917  FARMERS INS CO INC  388
 60702  ALLSTATE INS CO  361
 62772  ALLSTATE INS  336
 61184  STATE FARM FI & CAS CO  133
 61178  STATE FARM FI & CAS CO  95

Here, a purely numeric policy leads to more ambiguous results, but it looks as though Farmers, Allstate, and State Farm are likely candidates. The operator may find this sort of information helpful or confusing. Sorting all of this out takes some effort and careful analysis.

It might at first seem daunting to get one’s arms around how to grapple with this sort of problem. But it merely needed a quite straightforward bit of data mining:

  1. Make a collection of problem transactions and arrange by policy number and payee code.

  2. Inspect the policy numbers of the high volume exceptions and look for patterns.

  3. Prepare some pattern matching software in prototype to catch the policy patterns.

  4. Attempt to predict the payee code on a set of randomly selected transactions.

  5. Quantify the success rate as a confidence score.

  6. Iterate for optimal effect, with examples selected at random.
The resulting model employs simple statistics – percent of the pattern in the transaction population, percent of the payee in the transaction population, and percent of the pattern for the payee – to come up with a predictive score. All of this goes to a database which is maintained in memory and is instantly available to any transaction. Here are the two database records which provided the scores for the first example above:

Payee Code: 61074
Score: 486
Regular Expression: [0-9]{4}[^a-z]{2}[0-9]{6}
Min. Length: 12
Max Length: 12
Pattern / Population: 0.021
Pattern / Payee: 0.8
Payee / Population: 0.017

Payee Code: EDINA
Score: 340
Regular Expression: [0-9]{4}[^a-z]{2}[0-9]{6}
Min. Length: 12
Max Length: 12
Pattern / Population: 0.0298
Pattern / Payee: 1
Payee / Population: 0.0045
A policy component such as the payee code software is liable to require some handcrafting, but it was not in this case very labor-intensive, considering the potential payoff in improved transaction throughput and accuracy in a hectic and ambiguous work environment. The prototype in the background of this discussion required about two weeks’ work for a single person. Even if it had turned up exactly nothing, it might have been worth an investment of that size to discover where not to look further for relief.

The results, however, were a bit better than that. Whether they were worth production deployment was another question. Fortunately, the statistics which predict the effect of such a system are quite reliable. Unknown, and unknowable from the perspective of the prototype, is the actual effect on the business, since that depends on variables outside of the data mining and prototyping. Those have to do with volumes, training, and other environmental and economic factors. Decisions concerning the purely business problems had to be supplied from business managers, but the costliness of the problem made them more than willing to consider them.

This two-week effort, in other words, made it possible to create a reliable cost benefit analysis of deploying the prototype into production. After the prototyping effort, nearly all risk involved with the technology had been surmounted. The deployment strategy dictated where and how to implement the new facility within the workflow. But the risk of having incorrect or impractical software had essentially been achieved with the conclusion of the prototype.

Cost Benefit Analysis

The cost benefit analysis for this project was not formal. It was, in fact, a proverbial slam dunk. Its numbers, however, are well worth reviewing here for two reasons: (1) readers of this description will not have the familiarity which the managers did, and (2) the consideration of ROI is a crucial piece of any policy component.

The plug-in nature of components generally is familiar to one and all as bits of technology. Now consider the nature of the component as part of a business environment. On the one hand, there is a legacy system and process for transactions; on the other is a functional prototype which might plug in to the legacy system and which would address a quantifiable bottleneck in the existing system. Those of you who have managed IT projects, either from the technical or business side, know the difficulty of measuring scope, risk, and the impact on the overall process at the beginning of a six-month campaign of software development.

Contemplate the component, all fabricated apart from the system by one or two people, essentially without need for management. All that is required with such a prototype in hand is to measure what exactly the component does and see how much that would help or hurt. In the case of the payee code, the help was considerable but not, alas, all that might have been hoped for. It was estimated that for a typical batch of transactions, the components could accurately predict the payee code approximately 8% of the time and offer useful suggestions 33% of the time.

These numbers suggested that perhaps the component might reduce the payee code exception queue by 10%, increase customer satisfaction and reduce error handling on 5,000 transactions a week, thereby speeding up the flow, creating goodwill and increasing competitive edge (all without quantification), but (a hard number) saving nearly 100 hours of labor.

The following screen from the prototype shows the results of running 100 randomly selected policies against the patterns and evaluating the performance:

Scored 100 transactions in 16 milliseconds
Percent scored: 49.00%
Absolute percent accurate top score: 12.00%
Absolute percent accurate secondary score: 9.00%
Percent accurate top score of those processed: 24.49%
Percent accurate secondary score of those processed: 18.37%

Average correct top score: 426.67
Average correct secondary score: 226.67
In this particular case, nearly 50% of the policies received some sort of score, 12% of which were accurate and another 9% plausibly helpful. The average score of the accurate predictions is exactly 200 points higher than those which seem merely helpful. A series of such random selections offers the sort of quantification which should be helpful for making a confident decision.

Meanwhile, the component has been completed and technical risk eliminated. Nevertheless, coding is necessary to install the component into the legacy system, to make changes to workflow and to provide documentation and training. Such things will vary of course and need not concern us, except to note that the estimates should be relatively accurate. The remaining build-out resembles a plumbing project more closely than it does advanced technology.


These results have genuine meaning only when the deployment strategy has been thought out. Other numbers might be more relevant to other deployment strategies. The following seemed to be the most straightforward way to effect deployment in this transaction environment:
  1. Subject all transactions to pattern matching for payee code during preprocessing. (The scoring process requires negligible computer resources, as is evident in the last screen shot, which averages slightly over 15 milliseconds to process 100 randomly selected policy numbers.)

  2. Write a database record for each prediction; there might be more than one per transaction, in which case they would be ordered by score.

  3. Predictions for each transaction, if they exist, are available to the operator.

  4. An icon or some unobtrusive signal is presented for selection if there are predictions.

  5. The operator may select the icon and receive a pop-up list which indicates the payee codes, the payee names, and their scores, sorted by highest score at the top.

  6. The operator considers these recommendations and chooses any one (or none) of them.

  7. Transaction logging reflects any of these choices in the history of the transaction for subsequent analysis and tuning.
This strategy may or may not be feasible or desirable in any given work environment. The decision and strategy to deploy the prototype of a policy component requires a thorough review. The strategy is available for review, correction, emendation, and the like. A completely different strategy might require a different set of statistical results, which should not be difficult to produce. In any event, concrete “policy” is now available for decision support.


A final aspect of deployment involves maintenance. In this example, payee codes are quite volatile; the mortgage bankers are liable to change, add, and delete them frequently. When new mortgage bankers are added to the system, this subsystem needs to be maintained as well. The prototype implements its scoring by loading a database table with these columns:
  • A unique identifier for the mortgage banker

  • The payee code

  • The pattern as a regular expression

  • A minimum length

  • A maximum length

  • The resultant score

  • Some statistical numbers for internal use, generated by the software
Maintenance is therefore quite simple to effect physically. The “logical” maintenance is somewhat more difficult. That requires the statistical routines to be applied. The prototype has these in place; so running them on a schedule or as the situation demands is simple enough to effect, but does require some time, planning, and budgeting, and decisions. The implemented system, however, need never require coding changes to keep pace with environmental changes. 

Policy Components in Summary

This example was selected for its simplicity. It starkly profiles a policy component. A policy component has the following features:
  1. Implements policy as predictive modeling at any place in the enterprise from point of sale to boardroom

  2. Gives results from specific analysis, often through data mining and statistical modeling

  3. Employs a RAD prototyping methodology

  4. Has a predictable ROI after prototype and prior to full implementation

  5. Is capable of nearly any statistical functionality

  6. Is “pluggable,” “embeddable” and extremely efficient
Although this example was selected primarily for its simplicity, policy components can support far greater complexity. For example, a mortgage banker implemented a policy component, which was actually deployed more as a full subsystem, to evaluate its entire portfolio for the likelihood of each of its loans becoming 30, 60, or 90 days delinquent. The statistical analysis for this component employed logistic and linear regression on eight variables, cluster analysis, and the transformation of the entire model into fuzzy sets. The scoring (which included reading and writing text files and the “persisting” of objects) averaged 20 milliseconds per loan. The prototyping phase for this project required six weeks for two people. The implementation employed a team of several developers for nearly four months, but the predicted ROI of over $1,000,000 per annum was easily realized and (more importantly) was very close to the prediction.

  • Terry Hipolito, Ph.D.Terry Hipolito, Ph.D.
    Terry has several years’ experience with software development and architecture, statistical modeling, databases and project management; his education includes a Ph.D. from UCLA. Terry is now an independent consultant who specializes in the design, development and deployment of “policy components.” He is writing a book on this subject, complete with methodology, statistical theory and full examples. A subset of this content will soon be available on www.policybots.com. You may reach Terry via tahipolito@earthlink.net or by fax at (714) 993-3218.



Want to post a comment? Login or become a member today!

Be the first to comment!