Pandas and Snow White

``` {.sourceCode .python} import pandas as pd import numpy as np %cd ~/Dev/Python/SnowWhite

::: {.parsed-literal}
/home/tim/Dev/Python/SnowWhite
:::

``` {.sourceCode .python}
#My CSV has true/false for col 0 and then the hours people work in col 1-15
data=np.genfromtxt('ts_learn.csv',delimiter=',',usecols=(1,2,3,4,5,6,7,8,9,10,11,12,13,14,15))
target=np.genfromtxt('ts_learn.csv',delimiter=',',usecols=(0))

``` {.sourceCode .python} data.shape

::: {.parsed-literal}
(81, 15)
:::

``` {.sourceCode .python}
target.shape

::: {.parsed-literal} (81,) :::

``` {.sourceCode .python} target[0]=0 set(target)

::: {.parsed-literal}
{0.0, 1.0}
:::

``` {.sourceCode .python}
from sklearn.naive_bayes import GaussianNB

``` {.sourceCode .python} classifier = GaussianNB()

``` {.sourceCode .python}
#Use the test data with the target results set
classifier.fit(data,target)

::: {.parsed-literal} GaussianNB() :::

``` {.sourceCode .python}

Lets Test the classifier

classifier.predict(data[0])

::: {.parsed-literal}
array(\[ 0.\])
:::

``` {.sourceCode .python}
from sklearn import cross_validation

``` {.sourceCode .python}

Using 40% of the data set....

train, test, t_train, t_test = cross_validation.train_test_split(data, target,test_size=0.4, random_state=0)

``` {.sourceCode .python}
classifier.fit(train,t_train) # train
classifier.score(test,t_test) # test
print("We are %f%% Accurate "%(100.0*classifier.score(test,t_test)))

::: {.parsed-literal} We are 87.878788% Accurate :::

``` {.sourceCode .python}

Another tool to estimate the performance of a classifier is the confusion matrix. In this matrix each column

represents the instances in a predicted class,

while each row represents the instances in an

actual class. Using the module metrics it is pretty easy to compute and print the matrix:

from sklearn.metrics import confusion_matrix confusion_matrix(classifier.predict(test),t_test)

::: {.parsed-literal}

array(\[\[27, 4\],

:   \[ 0, 2\]\])
:::

``` {.sourceCode .python}
#A function that gives us a complete report on the performance of the classifier is also available: 
#Our results are 0 and 1.... 
from sklearn.metrics import classification_report
print("%s"%classification_report(classifier.predict(test), t_test))

::: {.parsed-literal} precision recall f1-score support :::

0.0 1.00 0.87 0.93 31 1.0 0.33 1.00 0.50 2

avg / total 0.96 0.88 0.90 33

``` {.sourceCode .python}

Here is a summary of the measures used by the report:

Precision: the proportion of the predicted positive cases that were correct

Recall (or also true positive rate): the proportion of positive cases that were correctly identified

F1-Score: the harmonic mean of precision and recall

``` {.sourceCode .python}
from sklearn.cross_validation import cross_val_score
# cross validation with 6 iterations 
scores = cross_val_score(classifier, data, target, cv=6)
scores

::: {.parsed-literal} array([ 0.75 , 0.66666667, 1. , 0.5 , 1. , 1. ]) :::

``` {.sourceCode .python}

As we can see, the output of this implementation is a vector that

contains the accuracy obtained with each iteration of the model.

We can easily compute the mean accuracy as follows:

``` {.sourceCode .python}
from numpy import mean
mean(scores)

::: {.parsed-literal} 0.81944444444444453 :::