Chapter 1.3: Data Magic with Pandas

Load, Explore, and Transform Data Like a Pro

๐Ÿš€ PROJECT 1.3 | Difficulty: Intermediate | Time: 20 minutes

๐Ÿ“Š Complexity Level: Intermediate โญโญ

Learn the powerful pandas library for data manipulation. Pandas is used by data scientists, analysts, and developers worldwide!

๐Ÿ’ป Interactive Options:

  • ๐Ÿ““ Open in JupyterLite - Full Jupyter environment in your browser
  • โ–ถ๏ธ Run code directly below - All code cells on this page are editable and runnable
  • ๐Ÿ“ฅ Download Notebook (Challenge) - For use in local Jupyter or Google Colab

๐Ÿ“– Introduction: What is Pandas?

Pandas is Pythonโ€™s most popular library for working with data. Think of it as Excel on steroidsโ€”you can:

  • Load data from CSV files, Excel spreadsheets, databases, and more
  • Filter and sort data based on conditions
  • Calculate statistics and aggregations
  • Transform and clean messy data
  • Merge multiple datasets together

๐ŸŽฏ Real-World Use: Data scientists at companies like Netflix, Spotify, and Uber use pandas every day to analyze user behavior, optimize recommendations, and make business decisions based on millions of data points!

๐Ÿ”‘ Core Pandas Concepts

DataFrames: Your Data Spreadsheet

A DataFrame is pandasโ€™ main data structure. Itโ€™s like a table with rows and columns:

import pandas as pd
import numpy as np

# Create a simple DataFrame
data = {
    'Name': ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve'],
    'Age': [22, 21, 23, 22, 20],
    'Major': ['CS', 'Math', 'CS', 'Biology', 'CS'],
    'GPA': [3.8, 3.6, 3.9, 3.7, 3.5]
}

df = pd.DataFrame(data)
print("๐Ÿ“Š Student DataFrame:")
print(df)

Series: A Single Column

Each column in a DataFrame is a Series:

# Access a single column (returns a Series)
names = df['Name']
print("Names column:")
print(names)
print(f"\nType: {type(names)}")

๐Ÿ“Š Loading Real Data

Letโ€™s create a realistic dataset to work with:

# Create a larger, more realistic dataset
np.random.seed(42)  # For reproducible results

students_data = {
    'StudentID': range(1001, 1051),
    'Name': [f'Student{i}' for i in range(1, 51)],
    'Age': np.random.randint(18, 25, 50),
    'Major': np.random.choice(['CS', 'Math', 'Biology', 'Physics', 'Engineering'], 50),
    'GPA': np.round(np.random.uniform(2.5, 4.0, 50), 2),
    'Credits': np.random.randint(30, 120, 50),
    'Scholarship': np.random.choice([True, False], 50)
}

students_df = pd.DataFrame(students_data)

print("๐Ÿ“š Large Student Dataset:")
print(students_df.head())  # Show first 5 rows
print(f"\nDataset shape: {students_df.shape[0]} rows ร— {students_df.shape[1]} columns")

๐Ÿ“ The head() Method

df.head(n) shows the first n rows (default is 5). This is useful for large datasets!

Similarly, df.tail(n) shows the last n rows.

๐Ÿ” Exploring Your Data

Basic Information

# Get dataset information
print("โ„น๏ธ Dataset Info:")
print(f"Shape: {students_df.shape}")
print(f"Columns: {list(students_df.columns)}")
print(f"\nData types:")
print(students_df.dtypes)

Statistical Summary

# Get statistical summary of numerical columns
print("๐Ÿ“Š Statistical Summary:")
print(students_df.describe())

Checking for Missing Data

# Check for missing values
print("๐Ÿ” Missing Values:")
print(students_df.isnull().sum())

๐ŸŽฏ Filtering and Selecting Data

Filter by Condition

# Find CS students with GPA > 3.5
top_cs_students = students_df[(students_df['Major'] == 'CS') & (students_df['GPA'] > 3.5)]

print("๐ŸŒŸ Top CS Students (GPA > 3.5):")
print(top_cs_students[['Name', 'GPA', 'Major']])
print(f"\nFound {len(top_cs_students)} students!")

Select Specific Columns

# Select only Name, Major, and GPA
subset = students_df[['Name', 'Major', 'GPA']]
print("๐Ÿ“‹ Selected Columns:")
print(subset.head())

Sorting Data

# Sort by GPA in descending order
sorted_by_gpa = students_df.sort_values('GPA', ascending=False)
print("๐Ÿ† Top 10 Students by GPA:")
print(sorted_by_gpa[['Name', 'Major', 'GPA']].head(10))

๐Ÿงฎ Calculations and Aggregations

Group By

# Calculate average GPA by major
major_stats = students_df.groupby('Major')['GPA'].agg(['mean', 'min', 'max', 'count'])
major_stats = major_stats.round(2)

print("๐Ÿ“Š GPA Statistics by Major:")
print(major_stats)

Adding Calculated Columns

# Add a column for years until graduation (assuming 120 credits needed)
students_df['CreditsRemaining'] = 120 - students_df['Credits']
students_df['YearsRemaining'] = (students_df['CreditsRemaining'] / 30).round(1)

print("๐ŸŽ“ Graduation Timeline:")
print(students_df[['Name', 'Credits', 'YearsRemaining']].head(10))

๐Ÿ› ๏ธ Data Transformation

Creating Categories

# Categorize students by GPA
def gpa_category(gpa):
    if gpa >= 3.7:
        return 'Excellent'
    elif gpa >= 3.3:
        return 'Good'
    elif gpa >= 3.0:
        return 'Satisfactory'
    else:
        return 'Needs Improvement'

students_df['Performance'] = students_df['GPA'].apply(gpa_category)

print("๐Ÿ“ˆ Performance Categories:")
print(students_df['Performance'].value_counts())

Value Counts

# Count students by major
print("๐Ÿ‘ฅ Students per Major:")
print(students_df['Major'].value_counts())

print("\n๐Ÿ’ฐ Scholarship Distribution:")
print(students_df['Scholarship'].value_counts())

๐Ÿ’พ Saving Your Data

# Save to CSV (this would run locally)
students_df.to_csv('student_data.csv', index=False)

# Save to Excel
students_df.to_excel('student_data.xlsx', index=False)

# Load it back
loaded_df = pd.read_csv('student_data.csv')

๐Ÿ’ก Pro Tip: Always use index=False when saving to CSV unless you specifically want to save the row numbers!

๐ŸŽฎ Practice Challenges

๐Ÿ† Challenge 1: Data Detective

Using the students_df dataset:

  1. Find all students with more than 90 credits
  2. Calculate the percentage of students on scholarship
  3. Find which major has the highest average GPA

Try it yourself!

# Your code here!
# Challenge 1
high_credits = students_df[students_df['Credits'] > 90]
print(f"Students with 90+ credits: {len(high_credits)}")

# Challenge 2
scholarship_pct = (students_df['Scholarship'].sum() / len(students_df)) * 100
print(f"Percentage on scholarship: {scholarship_pct:.1f}%")

# Challenge 3
major_gpa = students_df.groupby('Major')['GPA'].mean().sort_values(ascending=False)
print(f"\nHighest GPA Major: {major_gpa.index[0]} ({major_gpa.iloc[0]:.2f})")

๐Ÿ† Challenge 2: Create Your Own Analysis

Create a new DataFrame with at least 3 columns and 10 rows about a topic youโ€™re interested in (movies, games, sports, etc.). Then:

  • Filter the data based on some condition
  • Sort by one column
  • Calculate some statistics

๐Ÿš€ Whatโ€™s Next?

Now that you can manipulate data with pandas, in the next slide weโ€™ll:

  • Create beautiful visualizations with matplotlib
  • Make bar charts, line graphs, and scatter plots
  • Visualize the student data we just analyzed!