machine learning - values keras and pandas

Catagorical

These are some examples

  • A “pet” variable with the values:
    • “dog” and “cat“.
  • A “color” variable with the values:
    • “red“, “green” and “blue“.
  • A “place” variable with the values:
    • “first”, “second” and “third“.

What is the Problem with Categorical Data?

Some algorithms can work with categorical data directly.

For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).

Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.

How to Convert Categorical Data to Numerical Data?

This involves two steps:

  • Integer Encoding
  • One-Hot Encoding

Binary Variables

The binary variables are often called “dummy variables” in other fields, such as statistics.

Pandas ONE-HOT

This is a great soltuion for this problem

def one_hot(df, cols):
    """
    @param df pandas DataFrame
    @param cols a list of columns to encode 
    @return a DataFrame with one-hot encoding
    """
    for each in cols:
        dummies = pd.get_dummies(df[each], prefix=each, drop_first=False)
        df = pd.concat([df, dummies], axis=1)
    return df

#Example
import pandas as pd
df = pd.read_csv('spy_data.csv')
df2=one_hot(df,['uni','gender'])

Or a slightly easier way. This removes the columns in the first place.

cols_to_encode=['uni','gender','department']
one_hot(df,cols_to_encode).drop(axis=1,columns=cols_to_encode)