# The Kaggle Higgs Challenge – Beat the benchmarks with scikit-learn

This post is intended as a quick-start guide to getting a competitive score in the Higgs Boson Machine Learning Challenge, using just a bit of python and scikit-learn.

The data consists of just a training.csv and test.csv file. Each of these has a column of ID numbers and 30 columns of feature quantities. Each row represents an event which is either a signal higgs event (s) or a background event (b). The training.csv indicates the truth value (s or b) as well as an event weight.

The goal of this challenge is to classify the events as signal in a manner that optimizes a certain metric. The metric for this challenge is the AMS metric, which is a function of the weighted numbers of correctly and incorrectly guessed signal events.

My approach is pretty simple. I am using the gradient boosting classifier, and I am using the probability output for each event to rank the events. I don’t take the classification at face value – I make a cutoff on the probability prediction and call the upper 15% of events as signal. You can optimize this threshold to maximize the AMS, but after a few tests I have found that the upper 15% is usually about right.

I divide the training sample, keeping 90% as training, and 10% as validation. I calculate the AMS on both to check for overtraining.

To check that the results look reasonable, I also plot the probability prediction. The red is training background, and the blue is training signal. The black dots are the testing data, normalized to the training sample. I haven’t applied any event weights, because we don’t know these in the test data. The blue shaded region is the upper 15% which I call signal to maximize the AMS metric.

The probability prediction for training signal and background, as well as on testing data, based on the gradient boosting classifier.

Without further ado, this is the code. It should predict in AMS of about 3.47, and the submission scored about 3.38 (either due to random fluctuation or a tiny bit of overtraining). It runs in 5 minutes for me. I am using sklearn 0.14.1 and python 2.7.3.

```import numpy as np
from sklearn.ensemble import GradientBoostingClassifier as GBC
import math

data_train = np.loadtxt( 'training.csv', delimiter=',', skiprows=1, converters={32: lambda x:int(x=='s'.encode('utf-8')) } )

# Pick a random seed for reproducible results. Choose wisely!
np.random.seed(42)
# Random number for training/validation splitting
r =np.random.rand(data_train.shape[0])

# Put Y(truth), X(data), W(weight), and I(index) into their own arrays
print 'Assigning data to numpy arrays.'
# First 90% are training
Y_train = data_train[:,32][r<0.9]
X_train = data_train[:,1:31][r<0.9]
W_train = data_train[:,31][r<0.9]
# Lirst 10% are validation
Y_valid = data_train[:,32][r>=0.9]
X_valid = data_train[:,1:31][r>=0.9]
W_valid = data_train[:,31][r>=0.9]

# Train the GradientBoostingClassifier using our good features
print 'Training classifier (this may take some time!)'
gbc = GBC(n_estimators=50, max_depth=5,min_samples_leaf=200,max_features=10,verbose=1)
gbc.fit(X_train,Y_train)

# Get the probaility output from the trained method, using the 10% for testing
prob_predict_train = gbc.predict_proba(X_train)[:,1]
prob_predict_valid = gbc.predict_proba(X_valid)[:,1]

# Experience shows me that choosing the top 15% as signal gives a good AMS score.
# This can be optimized though!
pcut = np.percentile(prob_predict_train,85)

# This are the final signal and background predictions
Yhat_train = prob_predict_train > pcut
Yhat_valid = prob_predict_valid > pcut

# To calculate the AMS data, first get the true positives and true negatives
# Scale the weights according to the r cutoff.
TruePositive_train = W_train*(Y_train==1.0)*(1.0/0.9)
TrueNegative_train = W_train*(Y_train==0.0)*(1.0/0.9)
TruePositive_valid = W_valid*(Y_valid==1.0)*(1.0/0.1)
TrueNegative_valid = W_valid*(Y_valid==0.0)*(1.0/0.1)

# s and b for the training
s_train = sum ( TruePositive_train*(Yhat_train==1.0) )
b_train = sum ( TrueNegative_train*(Yhat_train==1.0) )
s_valid = sum ( TruePositive_valid*(Yhat_valid==1.0) )
b_valid = sum ( TrueNegative_valid*(Yhat_valid==1.0) )

# Now calculate the AMS scores
print 'Calculating AMS score for a probability cutoff pcut=',pcut
def AMSScore(s,b): return  math.sqrt (2.*( (s + b + 10.)*math.log(1.+s/(b+10.))-s))
print '   - AMS based on 90% training   sample:',AMSScore(s_train,b_train)
print '   - AMS based on 10% validation sample:',AMSScore(s_valid,b_valid)

# Now we load the testing data, storing the data (X) and index (I)
data_test = np.loadtxt( 'test.csv', delimiter=',', skiprows=1 )
X_test = data_test[:,1:31]
I_test = list(data_test[:,0])

# Get a vector of the probability predictions which will be used for the ranking
print 'Building predictions'
Predictions_test = gbc.predict_proba(X_test)[:,1]
# Assign labels based the best pcut
Label_test = list(Predictions_test>pcut)
Predictions_test =list(Predictions_test)

# Now we get the CSV data, using the probability prediction in place of the ranking
print 'Organizing the prediction results'
resultlist = []
for x in range(len(I_test)):
resultlist.append([int(I_test[x]), Predictions_test[x], 's'*(Label_test[x]==1.0)+'b'*(Label_test[x]==0.0)])

# Sort the result list by the probability prediction
resultlist = sorted(resultlist, key=lambda a_entry: a_entry[1])

# Loop over result list and replace probability prediction with integer ranking
for y in range(len(resultlist)):
resultlist[y][1]=y+1

# Re-sort the result list according to the index
resultlist = sorted(resultlist, key=lambda a_entry: a_entry[0])

# Write the result list data to a csv file
print 'Writing a final csv file Kaggle_higgs_prediction_output.csv'
fcsv = open('Kaggle_higgs_prediction_output.csv','w')
fcsv.write('EventId,RankOrder,Class\n')
for line in resultlist:
theline = str(line[0])+','+str(line[1])+','+line[2]+'\n'
fcsv.write(theline)
fcsv.close()
```

Lastly, if you are interested in drawing the distribution as I have, this is the code:

```
from matplotlib import pyplot as plt

Classifier_training_S = gbc.predict_proba(X_train[Y_train>0.5])[:,1].ravel()
Classifier_training_B = gbc.predict_proba(X_train[Y_train<0.5])[:,1].ravel()
Classifier_testing_A = gbc.predict_proba(X_test)[:,1].ravel()

c_max = max([Classifier_training_S.max(),Classifier_training_B.max(),Classifier_testing_A.max()])
c_min = min([Classifier_training_S.min(),Classifier_training_B.min(),Classifier_testing_A.min()])

# Get histograms of the classifiers
Histo_training_S = np.histogram(Classifier_training_S,bins=50,range=(c_min,c_max))
Histo_training_B = np.histogram(Classifier_training_B,bins=50,range=(c_min,c_max))
Histo_testing_A = np.histogram(Classifier_testing_A,bins=50,range=(c_min,c_max))

# Lets get the min/max of the Histograms
AllHistos= [Histo_training_S,Histo_training_B]
h_max = max([histo[0].max() for histo in AllHistos])*1.2
# h_min = max([histo[0].min() for histo in AllHistos])
h_min = 1.0

# Get the histogram properties (binning, widths, centers)
bin_edges = Histo_training_S[1]
bin_centers = ( bin_edges[:-1] + bin_edges[1:]  ) /2.
bin_widths = (bin_edges[1:] - bin_edges[:-1])

# To make error bar plots for the data, take the Poisson uncertainty sqrt(N)
ErrorBar_testing_A = np.sqrt(Histo_testing_A[0])
# ErrorBar_testing_B = np.sqrt(Histo_testing_B[0])

# Draw objects
ax1 = plt.subplot(111)

# Draw solid histograms for the training data
ax1.bar(bin_centers-bin_widths/2.,Histo_training_B[0],facecolor='red',linewidth=0,width=bin_widths,label='B (Train)',alpha=0.5)
ax1.bar(bin_centers-bin_widths/2.,Histo_training_S[0],bottom=Histo_training_B[0],facecolor='blue',linewidth=0,width=bin_widths,label='S (Train)',alpha=0.5)

ff = (1.0*(sum(Histo_training_S[0])+sum(Histo_training_B[0])))/(1.0*sum(Histo_testing_A[0]))

# # Draw error-bar histograms for the testing data
ax1.errorbar(bin_centers, ff*Histo_testing_A[0], yerr=ff*ErrorBar_testing_A, xerr=None, ecolor='black',c='black',fmt='.',label='Test (reweighted)')
# ax1.errorbar(bin_centers, Histo_testing_B[0], yerr=ErrorBar_testing_B, xerr=None, ecolor='red',c='red',fmt='o',label='B (Test)')

# Make a colorful backdrop to show the clasification regions in red and blue
ax1.axvspan(pcut, c_max, color='blue',alpha=0.08)
ax1.axvspan(c_min,pcut, color='red',alpha=0.08)

# Adjust the axis boundaries (just cosmetic)
ax1.axis([c_min, c_max, h_min, h_max])

# Make labels and title
plt.title("Higgs Kaggle Signal-Background Separation")
plt.ylabel("Counts/Bin")

# Make legend with smalll font
for alabel in legend.get_texts():
alabel.set_fontsize('small')

# Save the result to png
plt.savefig("Sklearn_gbc.png")
```

# Machine-Learning Examples: scikit-learn versus TMVA (CERN ROOT)

Basic Overview

When you think machine-learing+python, you usually think of scikit-learn. But there are alternatives! One of these is TMVA, a machine-learning package which is part of CERN’s ROOT analysis software. Conveniently, ROOT classes can be accessed in python with pyROOT.

For scikit-learn users, TMVA might have advantages – such as easy event weighting and interfaces to examine data correlations, and overtraining checks. This, along with the histogram-oriented tools of ROOT make it a good choice for physicists doing high-statistics signal-background separation.

For TMVA users, scikit-learn can provide a straightforward interface to machine learning, with with very easy applications of the trained classifier to new datasets. And of course, you get all the benefits of working with numpy arrays.

The exercises below are intended to show the same project being done in both TMVA and scikit-learn, with the goal of helping people be better able to switch between the two according to their needs. As you can see from the plots, very similar MVA classifiers for the SVM can be produced with either toolkit, but you’ll notice that the code for each is quite different.

Equivalent operations can be done in TMVA and scikit-learn. The goal of this post is to provide a guided example to working with both tools.

Before we get to the examples, here is a little cheat-sheet to convert between TMVA code and scikit-learn code, assuming a data file that contains both the X (Data) and Y(truth). The example here is an SVM with an rbf kernel.

 Action scikit-learn TMVA Load tools import numpy import ROOT from sklearn.svm import SVC ROOT.TMVA.Tools.Instance() Import data data=numpy.load(‘file.npy’) data=TFile.Open(‘file.root’).Get(‘ntuple_name’) Create Factory (n/a, user-guided) factory=ROOT.TMVA.Factory(“TMVAClassification”, … ) Initialize Data (n/a, will access numpy array) factory.AddSignalTree(ntuple); factory.AddBackgroundTree(ntuple) Prep Truth Y=data[:,truth_column]; X=data[:,[other_columns]] factory.PrepareTrainingAndTestTree(‘truth==1′,’truth==0’,OtherOptions) Split Train/Test r =numpy.random.rand(X.shape[0]) (n/a, done in factory OtherOptions with “SplitMode=Random”) Xtrain=X[r>0.5]; Xtest=X[r<=0.5] Ytrain=Y[r>0.5]; Ytest=Y[r<=0.5] Initialize SVM method = SVC(C = 10.0, kernel = ‘rbf’, method = factory.BookMethod( ROOT.TMVA.Types.kSVM, “SVM”, ,tol=0.001,gamma=0.25) “C=10.0:Gamma=0.25:Tol=0.001:VarTransform=Norm” ) Do Training method.fit(X,Y) factory.TrainAllMethods() Do Testing method.predict(Xtest) factory.TestAllMethods()

Making some pseudo-data to play with

Now we will examine an instance of machine-learning in scikit-learn and in TMVA. To do so, we will need some (pseudo) data. In the following code, I make 20K data-points with truth values of 0 or 1, which have some inherent separability because they are distributed according to different gaussian distributions. The data is saved both as a numpy array and as a ROOT ntuple.

```import random, numpy, ROOT

# Initialie root data file
f_root = ROOT.TFile.Open('data.root','RECREATE')
# Declare root ntuple, with 6 x-variables and one y (truth) variable
ntuple = ROOT.TNtuple("data","data","x1:x2:x3:x4:x5:x6:truth")

# Initialize list for numpy array
numpy_data = []

# Geneate and fill signal data (truth = 1.0), xvalues ceneterd at 100
for a in range(20000):
x =  [random.gauss(100.,5.) for i in range(6)]
ntuple.Fill(x[0],x[1],x[2],x[3],x[4],x[5],1.0)
numpy_data.append(x+[1.0])

# Geneate and fill signal data (truth = 0.0), xvalues ceneterd at 95
for a in range(20000):
x =  [random.gauss(95.0,5.) for i in range(6)]
ntuple.Fill(x[0],x[1],x[2],x[3],x[4],x[5],0.0)
numpy_data.append(x+[0.0])

# Save ROOT file
f_root.Write()
f_root.Close()

# Get and Save numpy array
numpy_data_as_array = numpy.array(numpy_data)
numpy.save('data.npy',numpy_data_as_array)

```

Now we have stored locally two identical datasets – ‘data.npy’ and ‘data.root’. It is worth noting that numpy.save() is not the most efficient means of storage, and that the .npy data is 11MB whereas the .root file is 4MB.

A machine-learning example in scikit-learn

Scikit-learn has the benefit of straightforward syntax and vectorized manipulations in numpy, which is useful for complicated splitting of the training and testing sample. Results are available on-call with the predict() and fit() functions. The plot below and its code show how easily one can separate datasets with thousands of points. In this example signal and background can be determined with 90% accuracy.

The output SVM classifier calculated with scikit-learn SVC. Compatibility between dots (testing data) and histograms (training data) indicates that overtraining is not a problem. The output SVM classifier calculated with scikit-learn SVC. Compatibility between dots (testing data) and histograms (training data) indicates that overtraining is not a problem.

```import numpy
from sklearn.svm import SVC
from matplotlib import pyplot as plt

# Get the pseudo-data from the npy file

# Split numpy array into X (data) and Y (truth)
# Y is just last column
Y=npy_data[:,-1]
# X is all but last column
X=npy_data[:,:-1]

# Random number list for splitting training and testing
r =numpy.random.rand(X.shape[0])

# Get training and testing samples, splitting in half (using 0.5)
Xtrain=X[r>0.5]
Xtest=X[r<=0.5]
Ytrain=Y[r>0.5]
Ytest=Y[r<=0.5]

# Create and run the SVC with an rbf kernel
# Some tuning has been done to get the gamma and C values.
method_sklearn = SVC(C = 1.0, kernel = 'rbf',tol=0.001,gamma=0.005)
method_sklearn.fit(Xtrain,Ytrain)

# The percent of correctly classified training and testing data
# should be roughly equivalent (i.e. not overtrained)
# and ideally, near 100%.  We will see about 90% success.
print '---------- Training/Testing info ----------'
print 'Trained SVM correctly classifies',
print 100*(sum(method_sklearn.predict(Xtest)==Ytest))/Xtest.shape[0],
print '%  of the testing data and ',
print 100*(sum(method_sklearn.predict(Xtrain)==Ytrain))/Xtrain.shape[0],
print '%  of the training data. ' print '-'*50

# Now lets view the trained classifier. To start, we will get the values of
# the classifier in training and testing.
# It come out like [[a] [b] [c]], so we ravel it into [a b c]
# Total of four things: Training and Testing For signal and background (S and B)

Classifier_training_S = method_sklearn.decision_function(Xtrain[Ytrain>0.5]).ravel()
Classifier_training_B = method_sklearn.decision_function(Xtrain[Ytrain<0.5]).ravel()
Classifier_testing_S = method_sklearn.decision_function(Xtest[Ytest>0.5]).ravel()
Classifier_testing_B = method_sklearn.decision_function(Xtest[Ytest<0.5]).ravel()

# This will be the min/max of our plots
c_max = 2.0
c_min = -2.0

# Get histograms of the classifiers
Histo_training_S = numpy.histogram(Classifier_training_S,bins=40,range=(c_min,c_max))
Histo_training_B = numpy.histogram(Classifier_training_B,bins=40,range=(c_min,c_max))
Histo_testing_S = numpy.histogram(Classifier_testing_S,bins=40,range=(c_min,c_max))
Histo_testing_B = numpy.histogram(Classifier_testing_B,bins=40,range=(c_min,c_max))

# Lets get the min/max of the Histograms
AllHistos= [Histo_training_S,Histo_training_B,Histo_testing_S,Histo_testing_B]
h_max = max([histo[0].max() for histo in AllHistos])*1.2
h_min = max([histo[0].min() for histo in AllHistos])

# Get the histogram properties (binning, widths, centers)
bin_edges = Histo_training_S[1]
bin_centers = ( bin_edges[:-1] + bin_edges[1:]  ) /2.
bin_widths = (bin_edges[1:] - bin_edges[:-1])

# To make error bar plots for the data, take the Poisson uncertainty sqrt(N)
ErrorBar_testing_S = numpy.sqrt(Histo_testing_S[0])
ErrorBar_testing_B = numpy.sqrt(Histo_testing_B[0])

# Draw objects
ax1 = plt.subplot(111)

# Draw solid histograms for the training data
ax1.bar(bin_centers-bin_widths/2.,Histo_training_S[0],facecolor='blue',linewidth=0,width=bin_widths,label='S (Train)',alpha=0.5)
ax1.bar(bin_centers-bin_widths/2.,Histo_training_B[0],facecolor='red',linewidth=0,width=bin_widths,label='B (Train)',alpha=0.5)

# # Draw error-bar histograms for the testing data
ax1.errorbar(bin_centers, Histo_testing_S[0], yerr=ErrorBar_testing_S, xerr=None, ecolor='blue',c='blue',fmt='o',label='S (Test)')
ax1.errorbar(bin_centers, Histo_testing_B[0], yerr=ErrorBar_testing_B, xerr=None, ecolor='red',c='red',fmt='o',label='B (Test)')

# Make a colorful backdrop to show the clasification regions in red and blue
ax1.axvspan(0.0, c_max, color='blue',alpha=0.08)
ax1.axvspan(c_min,0.0, color='red',alpha=0.08)

# Adjust the axis boundaries (just cosmetic)
ax1.axis([c_min, c_max, h_min, h_max])

# Make labels and title
plt.title("Classification with scikit-learn")
plt.xlabel("Classifier, SVM [rbf kernel, C=1, gamma=0.005]")
plt.ylabel("Counts/Bin")

# Make legend with smalll font
for alabel in legend.get_texts():
alabel.set_fontsize('small')

# Save the result to png
plt.savefig("Sklearn_example.png")

```

The equivalent procedure in TMVA

As you will see from the code, everything relies heavily on the TMVA factory which process the training and testing and outputs all relevant information into a single file, from which the plots can be easily produced, and a cut on the discriminant can be optimized for desired signal purity.

Again we see no evidence of overtraining. TMVA does not classify as signal or background inherently – it is up to the user to choose a discriminant cutoff.

```import ROOT

# Get the data from the ROOT file
root_data = ROOT.TFile.Open('data.root').Get('data')

# Useful output information will be stored in a new root file:
f_out = ROOT.TFile("LearningOutput.root","RECREATE")

# Create the TMVA factory
ROOT.TMVA.Tools.Instance()
factory = ROOT.TMVA.Factory("TMVAClassification", f_out,"AnalysisType=Classification")

# Add the six variables to the TMVA factory as floats
for x in ['x1','x2','x3','x4','x5','x6']:

# Link the signal and background to the root_data ntuple

# cuts defining the signal and background sample
sigCut = ROOT.TCut("truth > 0.5")
bgCut = ROOT.TCut("truth <= 0.5")
# Prepare the training/testing signal/background
factory.PrepareTrainingAndTestTree(sigCut,bgCut,"SplitMode=Random:NormMode=NumEvents:!V")

# Book the SVM method and train/test
method = factory.BookMethod( ROOT.TMVA.Types.kSVM, "SVM", "C=1.0:Gamma=0.005:Tol=0.001:VarTransform=None" )
factory.TrainAllMethods()
factory.TestAllMethods()
factory.EvaluateAllMethods()

# Histogrammed results are already stored in a file for us!
# We will open this file (LearningOutput.root) shortly.
# These are histogram (TH) one-dimensional double (1D) objects
Histo_training_S = ROOT.TH1D('Histo_training_S','S (Train)',40,0.0,1.0)
Histo_training_B = ROOT.TH1D('Histo_training_B','B (Train)',40,0.0,1.0)
Histo_testing_S = ROOT.TH1D('Histo_testing_S','S (Test)',40,0.0,1.0)
Histo_testing_B = ROOT.TH1D('Histo_testing_B','B (Test)',40,0.0,1.0)

# Fetch the trees of events from the root file
TrainTree = f_out.Get("TrainTree")
TestTree = f_out.Get("TestTree")

# Cutting on these objects in the trees will allow to separate true S/B SCut_Tree = 'classID>0.5'
BCut_Tree = 'classID<0.5'

# Now lets project the tree information into those histograms
TrainTree.Project("Histo_training_S","SVM",SCut_Tree)
TrainTree.Project("Histo_training_B","SVM",BCut_Tree)
TestTree.Project("Histo_testing_S","SVM",SCut_Tree)
TestTree.Project("Histo_testing_B","SVM",BCut_Tree)

# Create the color styles
Histo_training_S.SetLineColor(2)
Histo_training_S.SetMarkerColor(2)
Histo_training_S.SetFillColor(2)
Histo_testing_S.SetLineColor(2)
Histo_testing_S.SetMarkerColor(2)
Histo_testing_S.SetFillColor(2)

Histo_training_B.SetLineColor(4)
Histo_training_B.SetMarkerColor(4)
Histo_training_B.SetFillColor(4)
Histo_testing_B.SetLineColor(4)
Histo_testing_B.SetMarkerColor(4)
Histo_testing_B.SetFillColor(4)

# Histogram fill styles
Histo_training_S.SetFillStyle(3001)
Histo_training_B.SetFillStyle(3001)
Histo_testing_S.SetFillStyle(0)
Histo_testing_B.SetFillStyle(0)

# Histogram marker styles
Histo_testing_S.SetMarkerStyle(20)
Histo_testing_B.SetMarkerStyle(20)

# Set titles
Histo_training_S.GetXaxis().SetTitle("Classifier, SVM [rbf kernel, C=1, gamma=0.005]")
Histo_training_S.GetYaxis().SetTitle("Counts/Bin")

# Draw the objects
c1 = ROOT.TCanvas("c1","",800,600)
ROOT.gStyle.SetOptStat(0)
ROOT.gStyle.SetOptTitle(0)
Histo_training_S.Draw("HIST")
Histo_training_B.Draw("HISTSAME")
Histo_testing_S.Draw("EPSAME")
Histo_testing_B.Draw("EPSAME")

# Reset the y-max of the plot
ymax = max([h.GetMaximum() for h in [Histo_training_S,Histo_training_B,Histo_testing_S,Histo_testing_B] ])
ymax *=1.2
Histo_training_S.SetMaximum(ymax)

# Create Legend
c1.cd(1).BuildLegend( 0.42,  0.72,  0.57,  0.88).SetFillColor(0)

l1=ROOT.TLatex()
l1.SetNDC();
l1.DrawLatex(0.26,0.93,"Classification with TMVA (ROOT)")

# Finally, draw the figure
c1.Print('ROOT_example.png')

```

Exploring the TMVA Gui

TMVA has the benefit of outputting results in a neatly-stored ROOT file which can be examined in a GUI. There are pre-existing tools like TMVAGui to look at correlation studies, overtraining checks, and much more.

It is very easy to load the TMVA Gui and make plots from the menu. Simply run:

```import ROOT
ROOT.TMVAGui("LearningOutput.root")
raw_input('Press Enter to exit')
```

The first thing you’ll notice is a window of point-and-click options, to create a suite of plots. These include information about the signal-efficiency and background rejection, histograms of the classifier (like we made earlier), correlation matrices, and parallel-coordinate visualizations of the SVM. Basically, it has everything you could want to see.

Conclusions

Either scikit-learn or TMVA will get the job done in most instances, but each has it’s own useful features.

Some of the benefits of TMVA are automated sample splitting for training and testing, many built-in learning methods, and a GUI with many useful visualizations. If that interests you, TMVA is worth looking into.

If you need to quickly run analysis on common data formats, including text files, and want all the benefits manipulating arrays in numpy, and want the flexibility to do your training and testing individually, scikit-learn is a good choice.