As the tumbleweeds drift through my lonely blog I decided to visit here again before the wake of progress reclaims these lost bits into oblivion.  It's been interesting lately seeing the uptake of data mining throughout our customer and partner base - especially with the interesting questions that have come through on how to implement various solutions.

Last week (I believe) I had an interesting question from a consulting partner on how to create a model to predict whether or not someone's bank balance would go to zero in the next week.  Cool problem that can definately be solved through DM.  Since I don't know who the bank is, or even if it's a real bank, I don't see any problem in sharing my reponse with the world here.

Edited Question

A balance that concerns them is when a customer goes to 0$ in one month. They would like to understand the prior transaction behavior of these customers so that they can effectively predict and prevent customers who may be on the cusp of overdrafting.

The data that I have is in two tables. One table contains the account balances for ~7800 customers. The second table contains transaction details categorized by transaction type (Withdrawal, loan activity, contrubutions etc) for those customers over a two year period.

What would be the best mining algorithm to use to create a model that can then be run against a larger dataset that I have with ~850 million transactions? Any specific approaches and insights would be very helpful.

My response:

I would create records that would represent a “customer transaction window” containing attributes about recent history plus a flag whether the user hit a $0 balance.

The Customer Transaction Window record would contain information about the customer and recent transactions.  For example, here are some columns that you may find predictive – there are many more, but these are some ideas to get started.

CustTransWindowId      

  • Average12moBalance  - average balance over past 12 months
  • Average3moBalance  - average balance over past 3 months
  • Average1moBalance  - average balance over past 1 month
  • Average1wkBalance – average balance over past 1 week
  • Largest1wkTransaction – largest transaction in past week as % of average 1 wk balance
  • AverageTransactionsPerMonth12 – average # of transactions over last 12 months
  • AverageTransactionsPerMonth3 – average # of transactions over last 3 months
  • NumberTransactionsPastMonth – Number of transactions in the last month
  • NumberTransactionRatio12 – number of transactions in the last month divided by 12 month ratio
  • NumberTransactionRatio3 – number of transactions in the last month divided by 3 month ratio
  • Month – the current month of year
  • Week – the current week of month
  • HitZeroBalance – True if customer hit 0 balance this week

Creating such a dataset will be the expensive/difficult part.  Also, you will assume that you will have many many more records with HitZeroBalance=False than true.  You can deal with this in two ways. 

  1. You can take all the weekly records, split into cases that hit zero and those that don’t and sample to balance the number of hits and non-hits.  This will give you a model that predicts zero balances across all customers
  2. Only consider weekly records where a customer has ever hit zero.  This will give a model that will be more tuned to discriminate between customers with somewhat risky behavior and when that behavior results in a zero balance.

In either case, you can apply the resultant model to all customers to get a predictive result, or you can create both and apply them to each individual sub-group.  As far as algorithms go, your best bet is Decision Trees and Neural Nets.  Trees will be faster and are generally as accurate, but there are cases where NN will give higher accuracy.

Once you have created the process for generating the records, you can easily apply the model to all the records for the current week.  Applying the model is extremely fast – many millions/hour – organizing the data into the correct input form will likely be more expensive

Jobs

I did say jobs right?  Currently we are actively looking for qualified people to be on our team - we are growing and need developers, testers, program managers, and even you marketing types to fill out our OLAP and Data Mining teams.  Don't be shy!  Be a part of the magic!