Pandas Dataframe & Operations Cheatsheet: The Complete Reference

Pandas is the core library for data manipulation and analysis in Python. Built on top of NumPy, it provides fast, flexible data structures designed to work with structured and relational data.

This reference sheet covers DataFrame loading, indexing patterns, data cleaning operations, aggregations, merging, and performance optimizations.

- **Data Loading**: Read structured datasets efficiently using standard parsers like `read_csv`, `read_parquet`, and `read_sql`. - **Indexing**: Access rows and columns using strict integer-based indexing (`iloc`) or label-based queries (`loc`). - **Data Cleaning**: Handle missing rows and duplicates with simple built-in handlers like `fillna`, `dropna`, and `drop_duplicates`. - **Groupings**: Summarize data using the high-performance split-apply-combine workflow (`groupby` and `agg`). - **Performance Optimization**: Reduce RAM footprint by leveraging categorical columns, vectorization, and fast file formats like Parquet.

Before diving into this cheatsheet, check out my previous deep-dive on NumPy Array Manipulations Cheatsheet: The Complete Reference to see how we structured these patterns in practice.

Loading & Initializing Data

Pandas supports reading from and writing to numerous file types and database engines.

import pandas as pd
import numpy as np

# 1. Loading data from files
df_csv = pd.read_csv('data.csv', parse_dates=['timestamp'])
df_parquet = pd.read_parquet('data.parquet')

# 2. Creating a DataFrame from a dictionary
data = {
    'name': ['Alice', 'Bob', 'Charlie'],
    'age': [25, 30, 35],
    'city': ['Mumbai', 'Delhi', 'Bangalore']
}
df = pd.DataFrame(data)

# 3. Quick structural inspection
print(df.info())      # Displays column types, null counts, and memory footprint
print(df.describe())  # Generates summary statistics for numerical columns

Indexing & Slicing Data

Pandas provides distinct ways to index and filter records based on labels (loc) or integer positions (iloc).

# 1. Label-based selection (.loc[row_label, col_label])
# Select rows where age > 28, displaying only the name and city columns
result_loc = df.loc[df['age'] > 28, ['name', 'city']]

# 2. Integer-position-based selection (.iloc[row_idx, col_idx])
# Select the first two rows and the first two columns
result_iloc = df.iloc[0:2, 0:2]

# 3. Clean boolean queries (useful for long conditional strings)
high_age_df = df.query('age >= 30')

# 4. Multi-level indexing (hierarchical indexes)
df_hierarchical = df.set_index(['city', 'name'])

Cleaning & Transforming Data

Preparing messy raw datasets involves handling nulls, modifying data types, and resolving duplicates.

# 1. Handling Null Values
df['age'] = df['age'].fillna(df['age'].mean()) # Fill nulls with mean value
df_clean = df.dropna(subset=['email'])          # Drop rows where email is null

# 2. Removing Duplicates
df_unique = df.drop_duplicates(subset=['name'], keep='first')

# 3. Explicit Data Type Conversion
df['age'] = df['age'].astype('int32')

# 4. Applying element-wise operations (Vectorized maps)
df['city_upper'] = df['city'].map(str.upper)
df['age_squared'] = df['age'].apply(lambda x: x ** 2)

Aggregations & GroupBy

The standard split-apply-combine workflow enables fast summaries of structured groups.

# 1. Basic group and mean aggregation
age_by_city = df.groupby('city')['age'].mean()

# 2. Applying multiple aggregate operations to different columns
group_summary = df.groupby('city').agg({
    'age': ['mean', 'min', 'max'],
    'name': 'count'
})

# Flattening the resulting multi-index columns for cleaner exports
group_summary.columns = ['_'.join(col).strip() for col in group_summary.columns.values]

# 3. Creating a Pivot Table
pivot = df.pivot_table(index='city', values='age', aggfunc='mean')

Merging & Joining DataFrames

Combine multiple datasets using horizontal merges (similar to SQL Joins) or vertical concatenations.

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'val1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['B', 'C', 'D'], 'val2': [4, 5, 6]})

# 1. Inner Join (SQL-equivalent)
inner_merged = pd.merge(df1, df2, on='key', how='inner')

# 2. Left Outer Join
left_merged = pd.merge(df1, df2, on='key', how='left')

# 3. Vertical Concatenation
concatenated = pd.concat([df1, df2], axis=0, ignore_index=True)

Optimizing Code Performance

Operating on large datasets requires memory awareness. Vectorized calculations are thousands of times faster than looping over records.

# 1. Leverage Categorical types for low-cardinality string columns
# Reduces RAM usage significantly for repetitive strings
df['city'] = df['city'].astype('category')

# 2. Avoid loops: Use Vectorized operations
# BAD (Slow):
# for idx, row in df.iterrows():
#     df.loc[idx, 'adjusted_age'] = row['age'] + 5

# GOOD (Instant vectorization):
df['adjusted_age'] = df['age'] + 5

# 3. Processing massive files in chunks
chunk_size = 10000
for chunk in pd.read_csv('massive_dataset.csv', chunksize=chunk_size):
    # Perform local aggregation step on each chunk
    process_chunk_data = chunk.groupby('category')['value'].sum()