Handling Missing Data and Categorical Features

Handling Missing Data and Categorical Features

By: Bindeshwar Singh Kushwaha


Data Preprocessing Flow

  • Raw Data → Handle Missing Values → Encode Categorical Variables → Feature Scaling → Preprocessed Data

Overview of Data Preprocessing

  • Load Titanic dataset from CSV file
  • Handle missing values using various techniques
  • Encode categorical data for machine learning
  • Save the cleaned dataset to a new CSV file

Step 1: Load the Titanic Dataset

import pandas as pd
df = pd.read_csv('titanic.csv')

Step 2: View the First Few Rows

print(df.head())

Use df.head() to preview the dataset structure.


Step 3: Checking for Missing Values

print(df.isnull().sum())

This helps identify missing data in columns.


Step 4: Handle Missing Values

df['Age'].fillna(df['Age'].median(), inplace=True)
df.dropna(subset=['Embarked'], inplace=True)

Fill missing Age values with median, drop rows with missing ‘Embarked’.


Step 5: Verify Missing Values

print(df.isnull().sum())

Ensure no missing values remain.


Step 6: Data Overview

print(df.describe())

Summary statistics for numerical columns.


Step 7: Encoding Categorical Data

df['Sex'] = df['Sex'].map({'male': 0, 'female': 1})

Convert gender into numerical form for model compatibility.


Step 8: Save the Cleaned Dataset

df.to_csv('cleaned_titanic.csv', index=False)

Store the cleaned data for future use.


Python Libraries Used

  • pandas – Data manipulation
  • numpy – Numerical operations
  • scikit-learn – Machine learning preprocessing tools

PDF

PreprocessingML-2

Video

Reach PostNetwork Academy


Thank You

©Postnetwork-All rights reserved.