Introduction to Data Preprocessing

Introduction to Data Preprocessing

What is Data Preprocessing?

Data Preprocessing comes right in after you have cleaned up your data and done some Exploratory Data Analysis. It is the step where we prepare the data for modeling. Modeling in Python needs numerical input.

Refreshing Pandas Skills

You can skip this section if you know the basics.

Before we proceed with the series, it is important to know the commands that can assist you in knowing your dataset well.

import pandas as pd
hiking = pd.read_json("datasets/hiking.json")
print(hiking.head())

image.png

print(hiking.columns)

image.png

print(hiking.dtypes)

image.png

Removing Missing Data

Sample Data

image.png

Dropping rows with null values

print(df.dropna())

image.png

Dropping specific rows from using an array

print(df.drop([1,2,3]))

image.png

Dropping a specific column(here axis=1 specifies that column needs to be dropped.)

print(df.drop("A", axis=1))

image.png

Fetching the not null rows from a specific column.

print(df[df["B"].notnull()])

image.png

Working on DataTypes

While preprocessing the data, many times the datatype of columns is not as desired. We use the following command to convert the column datatype.

Remember: Always apply the datatype that fits all of the data in the particular column.

This code sample will help you convert column "C" to the float datatype.

df["C"] = df["C"].astype("float")
print(df.dtypes)

Stratified Sampling

Train test split is done on the dataset for training and testing the model. Say, the original dataset is 80% class 1 and 20% class 2. You would want a similar distribution in both train and test datasets to make sure you have the best representation.

 # Total "labels" counts
y["labels"].value_counts()

image.png

X_train,X_test,y_train,y_test = train_test_split(X,y, stratify=y)
y_train["labels"].value_counts() 
y_test["labels"].value_counts()

image.png

image.png

Check out the exercises linked to this here

Interested in Machine Learning content? Follow me on Twitter and HashNode.