pandas and ml

#Convert the sex_code to a numeric
data['sex_cat']=data['sex'].astype('category')
data['sex_code']=data['sex_cat'].cat.codes
#Embarked Code also to 
data['embarked_cat']=data['embarked'].astype('category')
data['embarked_code']=data['embarked_cat'].cat.codes

To convert a column of "M","F" etc ... into 1,2

Split Data into Test & Train

from sklearn.model_selection import train_test_split
X_train_raw, X_test_raw, y_train, y_test = train_test_split(data['array'], data['chid'])

Split Data and convert array Pandas into Numpy data

I have 2 variabled in my pandas dataframe

  • array
  • chid

Array is a list (in this case integers). I need to make them into an numpy array so I can feed them into keras.

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(data['array'], data['chid'])
X_train_len=400

def my_reshape(data,rows,cols):
    v=[]

    for k in data:
        #print (''+str(k[0]))
        for i in k[0]:
            v.append(i)
    return np.array(v,dtype=np.float32).reshape(rows,cols)

X_train =my_reshape(X_train,19,400)
X_test = my_reshape(X_test,len(X_test),400)

HotEncoding

We need to represent 3 colours say Red,Green,Blue

so 1,2,3 ???

No !!

from sklearn.preprocessing import OneHotEncoder

#Data could be Red Green Blue

data=[[1],[2],[3]]

#But there is an implied Blue is more valuable than Red from a ML point of view...
#So we need to HotEncode this

ohe = OneHotEncoder(categorical_features='all',dtype=int)
ohe.fit_transform(data).toarray()

Which becomes

array([[1, 0, 0],
       [0, 1, 0],
       [0, 0, 1]], dtype=int64)

Hot Encoding in Pandas

import pandas as pd
v={'Name':['Tim','Bob','Frank'],'Pay':[3,5,9],'Age':[20,30,40],'EyeColor':['Blue','Green','Grey']}
d=pd.DataFrame.from_dict(v)
d
Age EyeColor Name Pay
0 20 Blue Tim 3
1 30 Green Bob 5
2 40 Grey Frank 9

Ok ... I now want to represent Some values with a 1,0 True/False notation.

Get_Dummies

pd.get_dummies(d[['Age','Name','Pay','EyeColor']])
Age Pay Name_Bob Name_Frank Name_Tim EyeColor_Blue EyeColor_Green EyeColor_Grey
0 20 3 0 0 1 1 0 0
1 30 5 1 0 0 0 1 0
2 40 9 0 1 0 0 0 1

If you only wanted to Lookup specific field (say EyeColor) then

pd.get_dummies(d[['EyeColor']])

Yields

   EyeColor_Blue    EyeColor_Green  EyeColor_Grey
0   1             0              0
1   0             1              0
2   0             0              1