# data_processing.py
import pandas as pd
def load_data(filepath):
"""Load data from CSV with error handling"""
try:
df = pd.read_csv(filepath)
print(f"โ
Loaded {len(df)} rows from {filepath}")
return df
except FileNotFoundError:
print(f"โ Error: {filepath} not found")
return None
except Exception as e:
print(f"โ Error loading data: {e}")
return None
def clean_data(df):
"""Clean and validate dataframe"""
# Remove duplicates
df = df.drop_duplicates()
# Handle missing values
numeric_columns = df.select_dtypes(include=['number']).columns
df[numeric_columns] = df[numeric_columns].fillna(df[numeric_columns].median())
print(f"โ
Cleaned data: {len(df)} rows remaining")
return df
def add_calculated_columns(df):
"""Add derived columns for analysis"""
# Example: Add age categories
if 'Age' in df.columns:
df['AgeGroup'] = pd.cut(df['Age'],
bins=[0, 20, 22, 25, 100],
labels=['18-20', '21-22', '23-25', '25+'])
return df
# Example usage
print("Data Processing Module Ready!")Chapter 1.6: Advanced Topics & Next Steps
Taking Your UV and Data Analysis Skills Further
๐ PROJECT 1.6 | Difficulty: Intermediate-Advanced | Time: 15 minutes
๐ Complexity Level: Intermediate-Advanced โญโญโญ
Learn advanced UV features, best practices, and discover whatโs next in your data analysis journey!
๐ป Interactive Options:
- ๐ Open in JupyterLite - Full Jupyter environment in your browser
- ๐ฅ Download Notebook (Challenge) - For use in local Jupyter or Google Colab
๐ Advanced UV Features
Now that youโre comfortable with UV basics, letโs explore some powerful advanced features!
1. Lock Files for Reproducibility
UV automatically creates a uv.lock file that freezes exact versions:
# Your uv.lock ensures everyone gets the EXACT same versions
uv sync # Installs exactly what's in uv.lock
# Update dependencies to latest compatible versions
uv lock --upgrade
# Update just one package
uv add --upgrade pandas๐ Why Lock Files Matter
Imagine your project works perfectly on your computer, but when your teammate tries to run it, it crashes! Often this happens because they have different package versions.
Lock files solve this by recording exact versions, ensuring everyone has identical environments.
2. Dev Dependencies
Separate tools you need for development from production dependencies:
# Add development-only packages
uv add --dev pytest pytest-cov black ruff
# Add regular dependencies
uv add pandas matplotlibYour pyproject.toml will separate them:
[project]
dependencies = [
"pandas>=2.0.0",
"matplotlib>=3.7.0"
]
[project.optional-dependencies]
dev = [
"pytest>=7.4.0",
"black>=23.0.0"
]3. Python Version Management
UV can manage Python versions too!
# Install a specific Python version
uv python install 3.12
# Use it in your project
uv python pin 3.12
# List available Python versions
uv python list4. Scripts and Tools
Run Python scripts without installing globally:
# Run a tool once (doesn't install permanently)
uvx ruff check .
# Run a script with its dependencies
uv run --with requests python fetch_data.py๐ก Pro Tip: uvx is like npx for Pythonโrun tools without installing them!
๐ฏ Best Practices for Data Analysis Projects
Project Structure
Organize your projects like a pro:
my-analysis-project/
โโโ data/
โ โโโ raw/ # Original, immutable data
โ โโโ processed/ # Cleaned data
โ โโโ outputs/ # Analysis results
โโโ notebooks/ # Jupyter notebooks for exploration
โโโ src/
โ โโโ data_processing.py
โ โโโ analysis.py
โ โโโ visualization.py
โโโ tests/ # Unit tests
โโโ README.md
โโโ pyproject.toml
โโโ uv.lock
Code Organization Example
Analysis Pipeline
# Create a reusable analysis pipeline
import matplotlib.pyplot as plt
import numpy as np
class DataAnalysisPipeline:
"""Reusable pipeline for data analysis"""
def __init__(self, data):
self.data = data
self.results = {}
def analyze(self):
"""Run complete analysis"""
self.descriptive_stats()
self.correlation_analysis()
self.group_analysis()
return self.results
def descriptive_stats(self):
"""Calculate descriptive statistics"""
self.results['mean'] = self.data.mean()
self.results['median'] = self.data.median()
self.results['std'] = self.data.std()
print("โ
Descriptive statistics calculated")
def correlation_analysis(self):
"""Analyze correlations"""
numeric_data = self.data.select_dtypes(include=[np.number])
self.results['correlations'] = numeric_data.corr()
print("โ
Correlation analysis complete")
def group_analysis(self):
"""Group-based analysis"""
# Example: if 'Major' column exists
if 'Major' in self.data.columns:
self.results['by_major'] = self.data.groupby('Major').mean()
print("โ
Group analysis complete")
def visualize(self):
"""Create summary visualizations"""
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig.suptitle('Analysis Summary', fontsize=16, fontweight='bold')
# Customize based on your data
numeric_cols = self.data.select_dtypes(include=[np.number]).columns[:4]
for idx, col in enumerate(numeric_cols):
ax = axes[idx // 2, idx % 2]
self.data[col].hist(ax=ax, bins=20, edgecolor='black')
ax.set_title(f'Distribution of {col}')
ax.grid(True, alpha=0.3)
plt.tight_layout()
plt.show()
print("โ
Visualizations created")
# Example usage
sample_data = pd.DataFrame({
'A': np.random.normal(100, 15, 50),
'B': np.random.normal(75, 10, 50),
'C': np.random.normal(85, 12, 50),
'Major': np.random.choice(['CS', 'Math', 'Bio'], 50)
})
pipeline = DataAnalysisPipeline(sample_data)
results = pipeline.analyze()
print("\n๐ Pipeline Results:")
print(f"Mean values:\n{results['mean']}")๐ Beyond the Basics: Next Tools to Learn
1. Seaborn - Beautiful Statistical Plots
# Seaborn makes complex visualizations easy
# (Note: Seaborn would need to be installed first)
# Example of what you can do:
"""
import seaborn as sns
# Beautiful distribution plot
sns.histplot(data=df, x='GPA', hue='Major', multiple='stack')
# Correlation heatmap
sns.heatmap(df.corr(), annot=True, cmap='coolwarm')
# Pair plot to see all relationships
sns.pairplot(df, hue='Major')
"""
print("๐จ Seaborn creates beautiful statistical visualizations!")
print("Install with: uv add seaborn")2. Plotly - Interactive Visualizations
# Plotly creates interactive plots you can explore
# Example of what you can create:
"""
import plotly.express as px
# Interactive scatter plot
fig = px.scatter(df, x='StudyHours', y='GPA',
color='Major', size='Age',
hover_data=['Name'],
title='Interactive Student Performance')
fig.show()
# Interactive dashboard
fig = px.box(df, x='Major', y='GPA', color='Scholarship')
fig.show()
"""
print("๐ Plotly creates interactive charts you can zoom, pan, and explore!")
print("Install with: uv add plotly")3. Scikit-learn - Machine Learning
# Machine learning for predictions
# Example workflow:
"""
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
# Predict GPA based on study hours and attendance
X = df[['StudyHours', 'Attendance']]
y = df['GPA']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
model = LinearRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
mse = mean_squared_error(y_test, predictions)
print(f'Model MSE: {mse:.4f}')
"""
print("๐ค Machine Learning can predict student performance!")
print("Install with: uv add scikit-learn")4. Streamlit - Build Web Apps
# Turn your analysis into an interactive web app
# Create a file: streamlit_app.py
import streamlit as st
import pandas as pd
import matplotlib.pyplot as plt
st.title("๐ Student Performance Dashboard")
uploaded_file = st.file_uploader("Upload your CSV file")
if uploaded_file:
df = pd.read_csv(uploaded_file)
st.write(df.head())
st.subheader("GPA Distribution")
fig, ax = plt.subplots()
ax.hist(df['GPA'], bins=20)
st.pyplot(fig)
# Interactive filters
major = st.selectbox("Select Major", df['Major'].unique())
filtered_df = df[df['Major'] == major]
st.write(f"Average GPA for {major}: {filtered_df['GPA'].mean():.2f}")
# Run with: streamlit run streamlit_app.py๐ Streamlit turns your Python scripts into interactive web apps in minutes!
Install with: uv add streamlit
Run with: streamlit run app.py
๐ฏ Real-World Project Ideas
Ready to build something amazing? Try these:
๐ฎ Project Ideas for Your Portfolio
Beginner Projects
- Personal Finance Tracker
- Track spending by category
- Visualize monthly trends
- Calculate savings rate
- Weather Data Analysis
- Load historical weather data
- Find patterns and trends
- Predict tomorrowโs temperature
- Movie/Book Ratings Analyzer
- Load your ratings from a CSV
- Find what genres you prefer
- Compare with friendsโ ratings
Intermediate Projects
- Sports Statistics Dashboard
- Analyze player performance
- Compare teams
- Visualize season trends
- Social Media Analytics
- Analyze post engagement
- Find best posting times
- Identify trending topics
- Health & Fitness Tracker
- Log workouts and meals
- Track progress over time
- Calculate fitness metrics
Advanced Projects
- Stock Market Analysis
- Load financial data
- Calculate indicators
- Visualize trends and predictions
- University Course Analyzer
- Analyze grade distributions
- Find easiest/hardest courses
- Recommend course combinations
- Air Quality Monitor
- Load environmental data
- Track pollution levels
- Identify patterns and alerts
๐ Learning Resources
Official Documentation
Tutorials & Courses
- Kaggle Learn - Free data science courses
- Real Python - Python tutorials
- DataCamp - Interactive courses
Datasets to Practice With
- Kaggle Datasets - Millions of datasets
- Google Dataset Search
- UCI Machine Learning Repository
- FiveThirtyEight Data
๐ Congratulations!
Youโve completed the UV & Data Analysis chapter! You now know:
โ
Modern Python package management with UV
โ
Data manipulation with Pandas
โ
Data visualization with Matplotlib
โ
Building complete analysis projects
โ
Best practices and next steps
These skills are highly valuable in:
- Data Science careers ๐ฌ
- Software Engineering ๐ป
- Research ๐
- Business Analytics ๐
- AI/Machine Learning ๐ค
๐ Whatโs Next?
Continue your coding adventure with the next chapters:
- Chapter 2: Pygame - Build exciting games with Python!
- Chapter 3: Manim - Create stunning math animations!
Or dive deeper into data science by exploring machine learning, neural networks, and AI!
๐ Youโre Ready!
You have the foundation to tackle real-world data problems. Start with a small project that interests you, and keep building from there. Every data scientist started exactly where you are now!