Pandas is the core library for data manipulation and analysis in Python. Built on top of NumPy, it provides fast, flexible data structures designed to work with structured and relational data.
This reference sheet covers DataFrame loading, indexing patterns, data cleaning operations, aggregations, merging, and performance optimizations.
Before diving into this cheatsheet, check out my previous deep-dive on NumPy Array Manipulations Cheatsheet: The Complete Reference to see how we structured these patterns in practice.
Loading & Initializing Data
Pandas supports reading from and writing to numerous file types and database engines.
import pandas as pd
import numpy as np
# 1. Loading data from files
df_csv = pd.read_csv('data.csv', parse_dates=['timestamp'])
df_parquet = pd.read_parquet('data.parquet')
# 2. Creating a DataFrame from a dictionary
data = {
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35],
'city': ['Mumbai', 'Delhi', 'Bangalore']
}
df = pd.DataFrame(data)
# 3. Quick structural inspection
print(df.info()) # Displays column types, null counts, and memory footprint
print(df.describe()) # Generates summary statistics for numerical columns
Indexing & Slicing Data
Pandas provides distinct ways to index and filter records based on labels (loc) or integer positions (iloc).
# 1. Label-based selection (.loc[row_label, col_label])
# Select rows where age > 28, displaying only the name and city columns
result_loc = df.loc[df['age'] > 28, ['name', 'city']]
# 2. Integer-position-based selection (.iloc[row_idx, col_idx])
# Select the first two rows and the first two columns
result_iloc = df.iloc[0:2, 0:2]
# 3. Clean boolean queries (useful for long conditional strings)
high_age_df = df.query('age >= 30')
# 4. Multi-level indexing (hierarchical indexes)
df_hierarchical = df.set_index(['city', 'name'])
Cleaning & Transforming Data
Preparing messy raw datasets involves handling nulls, modifying data types, and resolving duplicates.
# 1. Handling Null Values
df['age'] = df['age'].fillna(df['age'].mean()) # Fill nulls with mean value
df_clean = df.dropna(subset=['email']) # Drop rows where email is null
# 2. Removing Duplicates
df_unique = df.drop_duplicates(subset=['name'], keep='first')
# 3. Explicit Data Type Conversion
df['age'] = df['age'].astype('int32')
# 4. Applying element-wise operations (Vectorized maps)
df['city_upper'] = df['city'].map(str.upper)
df['age_squared'] = df['age'].apply(lambda x: x ** 2)
Aggregations & GroupBy
The standard split-apply-combine workflow enables fast summaries of structured groups.
# 1. Basic group and mean aggregation
age_by_city = df.groupby('city')['age'].mean()
# 2. Applying multiple aggregate operations to different columns
group_summary = df.groupby('city').agg({
'age': ['mean', 'min', 'max'],
'name': 'count'
})
# Flattening the resulting multi-index columns for cleaner exports
group_summary.columns = ['_'.join(col).strip() for col in group_summary.columns.values]
# 3. Creating a Pivot Table
pivot = df.pivot_table(index='city', values='age', aggfunc='mean')
Merging & Joining DataFrames
Combine multiple datasets using horizontal merges (similar to SQL Joins) or vertical concatenations.
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'val2': [4, 5, 6]})
# 1. Inner Join (SQL-equivalent)
inner_merged = pd.merge(df1, df2, on='key', how='inner')
# 2. Left Outer Join
left_merged = pd.merge(df1, df2, on='key', how='left')
# 3. Vertical Concatenation
concatenated = pd.concat([df1, df2], axis=0, ignore_index=True)
Optimizing Code Performance
Operating on large datasets requires memory awareness. Vectorized calculations are thousands of times faster than looping over records.
# 1. Leverage Categorical types for low-cardinality string columns
# Reduces RAM usage significantly for repetitive strings
df['city'] = df['city'].astype('category')
# 2. Avoid loops: Use Vectorized operations
# BAD (Slow):
# for idx, row in df.iterrows():
# df.loc[idx, 'adjusted_age'] = row['age'] + 5
# GOOD (Instant vectorization):
df['adjusted_age'] = df['age'] + 5
# 3. Processing massive files in chunks
chunk_size = 10000
for chunk in pd.read_csv('massive_dataset.csv', chunksize=chunk_size):
# Perform local aggregation step on each chunk
process_chunk_data = chunk.groupby('category')['value'].sum() Related Articles
Deepen your understanding with these curated continuations.
NumPy Array Manipulations Cheatsheet: The Complete Reference
Master NumPy: array initialization, indexing, slicing, broadcasting, linear algebra, and performance tuning.
PyTorch & CUDA ML Operations Cheatsheet: The Complete Reference
Optimize deep learning workloads: PyTorch tensor manipulations, CUDA memory management, multi-GPU training, and mixed precision.
FastAPI & Pydantic v2 Boilerplate Cheatsheet: The Complete Reference
Build high-performance APIs: FastAPI routers, Pydantic v2 models, dependency injection, async database integration, and security.