These are some examples
- A “pet” variable with the values:
- “dog” and “cat“.
- A “color” variable with the values:
- “red“, “green” and “blue“.
- A “place” variable with the values:
- “first”, “second” and “third“.
What is the Problem with Categorical Data?
Some algorithms can work with categorical data directly.
For example, a decision tree can be learned directly from categorical data with no data transform required (this depends on the specific implementation).
Many machine learning algorithms cannot operate on label data directly. They require all input variables and output variables to be numeric.
How to Convert Categorical Data to Numerical Data?
This involves two steps:
- Integer Encoding
- One-Hot Encoding
The binary variables are often called “dummy variables” in other fields, such as statistics.
This is a great soltuion for this problem
def one_hot(df, cols): """ @param df pandas DataFrame @param cols a list of columns to encode @return a DataFrame with one-hot encoding """ for each in cols: dummies = pd.get_dummies(df[each], prefix=each, drop_first=False) df = pd.concat([df, dummies], axis=1) return df #Example import pandas as pd df = pd.read_csv('spy_data.csv') df2=one_hot(df,['uni','gender'])
Or a slightly easier way. This removes the columns in the first place.