Overview
The purpose of this post is to run an example of using scikit-learn to separate a signal from a background. There are plenty of examples and discussions of scikit-learn online, but long examples can be challenging to a newcomer. The point of this example is to start with a simple and understandable dataset, and to use machine learning to get a visually intuitive output, like what you might see in some common examples.
To start, lets consider a toy dataset, in which you have a thousand customers, with information about their credit (fico) score, their income, and whether or not the defaulted on their loan. You want to use this information in the future to determine whether or not a new customer will default on their loan. It is a straightforward idea, and the first few lines might look like this:
In just a few lines of python, we can use this information to separate the defaulting customers from the customers that pay on time.
Line-by-line analysis
To start out, first we import numpy, which gives us the ability to manipulate multidimensional array datasets, with many similar features to Matlab or Octave. We also import pylab, for plotting capabilities. For learning, we are using SVC from sklearn.
import numpy,pylab; from sklearn.svm import SVC
Let’s assume that our spreadsheet was a comma-separated file. We will use numpy’s genfromtxt to read this file. The [1:] at the end tells numpy to ignore the first line and take everything after – effectively removing the title row of the spreadsheet and just leaving the real data.
DataTable = numpy.genfromtxt('data.csv',delimiter=',',dtype=None)[1:]
Next we will make two arrays. The first is called “DataPoints”. It is the second to columns of the spreadsheet – the fico score and income. This is what we know about the customers. We are going to try to use that information to predict whether or not the customer defaults. For these customers, we know the “TruthValue” of whether or not the default, it is stored in DataTable[:,0] – the first column. For simplicity, if this column says “ontime” our TruthValue will be 1, and otherwise it will be 0.
DataPoints,TruthValues = (DataTable[:,[1,2] ]).astype(numpy.float), (DataTable[:,0]=='ontime')
Next we want to do the training. An instance of SVC is trained (or fit) according to the DataPoints and TruthValues. We are using a linear kernel – basically meaning that if we plotted the fico score and income on a 2D plane, the bad and good customers would be separated with a line in the plane. We are also using a tuning of C=100. C is a penalty parameter that helps balance incorrect classifications against overtraining – This is a lesson for another day.
TrainedSVC = SVC(C = 100, kernel = 'linear').fit(DataPoints,TruthValues)
Now lets get the boundaries of a plot so we can draw some interesting quantities. Here we get the maxima and minima of the two variables (fico score and income).
x_max,y_max,x_min,y_min = DataPoints[:, 0].max(),DataPoints[:, 1].max(),DataPoints[:, 0].min(),DataPoints[:, 1].min()
Using the boundaries above, we can get the value of the SVC at every point in a fine mesh across our region of interest.
xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, (x_max-x_min)/200.0), numpy.arange(y_min, y_max, (y_max-y_min)/200.0))
We can then evaluate the TrainedSVC at every point in our mesh grid.
GridEvaluation = TrainedSVC.predict(numpy.c_[xx.ravel(),yy.ravel()]).reshape(xx.shape)
Finally, we can draw our results. We will draw a light colormesh of the GridEvaluation at the xx,yy points calcualted above. We will draw a scatter plot of all our data points, using the color “c” set to the truth values. So a red dot in the blue mesh would be misclassified, and vice versa.
pylab.pcolormesh(xx, yy, GridEvaluation,alpha=0.1) pylab.scatter(DataPoints[:, 0], DataPoints[:, 1], c=TruthValues) pylab.xlabel('Fico');pylab.ylabel('Income ($)');pylab.show()
My result looks like this:

In just a couple seconds, we have taken a spreadsheet with 1000 data points, use the information to separate two groups of customers, and can see great separation power. Only a few individuals in the group are wrongly classified.
The entire code
All summed up, the code looks like this:
import numpy,pylab; from sklearn.svm import SVC DataTable = numpy.genfromtxt('data.csv',delimiter=',',dtype=None)[1:] DataPoints,TruthValues = (DataTable[:,[1,2] ]).astype(numpy.float), (DataTable[:,0]=='ontime') TrainedSVC = SVC(C = 100, kernel = 'linear').fit(DataPoints,TruthValues) x_max,y_max,x_min,y_min = DataPoints[:, 0].max(),DataPoints[:, 1].max(),DataPoints[:, 0].min(),DataPoints[:, 1].min() xx, yy = numpy.meshgrid(numpy.arange(x_min, x_max, (x_max-x_min)/200.0), numpy.arange(y_min, y_max, (y_max-y_min)/200.0)) GridEvaluation = TrainedSVC.predict(numpy.c_[xx.ravel(),yy.ravel()]).reshape(xx.shape) pylab.pcolormesh(xx, yy, GridEvaluation,alpha=0.1) pylab.scatter(DataPoints[:, 0], DataPoints[:, 1], c=TruthValues) pylab.xlabel('Fico');pylab.ylabel('Income ($)');pylab.show()
Caveats
While working with scikit-learn can be pretty easy in practice, there are many issues to consider. Parameters like C should be tuned to avoid overtraining. Tests should be conducted on orthogonal data samples. The nuances of training/testing, cross-validation, and different kernels will be explored in future posts.
Making the pseudo-data
For reference, the csv file of pseudo-data for this exercise was made with the following snippet of code:
import random csv = 'customer,fico,income\n' for x in range(1000): a = random.choice(['ontime','default']) if a == 'ontime': fico= random.gauss(730,40) income = random.gauss(80000,12000) if a == 'default': fico= random.gauss(620,40) income = random.gauss(50000,12000) csv += a+','+str(round(fico,1))+','+str(round(income,1))+'\n' f = open('data.csv','w') f.write(csv) f.close()