Data Visualization Masterclass

🎯 Why Visualize Data?

Data visualization transforms abstract numbers into visual stories. The human brain processes images 60,000× faster than text. Visualization helps us explore, analyze, and communicate data effectively.

Anscombe's Quartet: Four datasets with nearly identical statistical properties (mean, variance, correlation) that look completely different when plotted. This demonstrates why visualization is essential - statistics alone can be misleading!

Three Purposes of Visualization

1. Exploratory: Discover patterns, anomalies, and insights in your data
2. Explanatory: Communicate findings to stakeholders clearly
3. Confirmatory: Verify hypotheses and validate models

💡 "The greatest value of a picture is when it forces us to notice what we never expected to see." — John Tukey

✅ Always start with visualization before building ML models.

👁️ Visual Perception & Pre-attentive Attributes

The human visual system can detect certain visual attributes almost instantly (< 250ms) without conscious effort. These are called pre-attentive attributes.

Pre-attentive Attributes:

Position: Most accurate for quantitative data (use X/Y axes)
Length: Bar charts leverage this effectively
Color Hue: Best for categorical distinctions
Color Intensity: Good for gradients/magnitude
Size: Bubble charts, but humans underestimate area
Shape: Useful for categories, but limit to 5-7 shapes
Orientation: Lines, angles

Cleveland & McGill's Accuracy Ranking

Most Accurate → Least Accurate:
1. Position on common scale (bar chart)
2. Position on non-aligned scale (multiple axes)
3. Length (bar)
4. Angle, Slope
5. Area
6. Volume, Curvature
7. Color saturation, Color hue

⚠️ Pie charts use angle (low accuracy). Bar charts are almost always better!

✅ Use position for most important data, color for categories.

📐 The Grammar of Graphics

The Grammar of Graphics (Wilkinson, 1999) is a framework for describing statistical graphics. It's the foundation of ggplot2 (R) and influences Seaborn, Altair, and Plotly.

Components of a Graphic:

Data: The dataset being visualized
Aesthetics (aes): Mapping data to visual properties (x, y, color, size)
Geometries (geom): Visual elements (points, lines, bars, areas)
Facets: Subplots by categorical variable
Statistics: Transformations (binning, smoothing, aggregation)
Coordinates: Cartesian, polar, map projections
Themes: Non-data visual elements (fonts, backgrounds)

💡 Understanding Grammar of Graphics makes you a better visualizer in ANY library.

🎨 Choosing the Right Chart

The best visualization depends on your data type and question. Here's a decision guide:

Single Variable (Univariate):
• Continuous: Histogram, KDE, Box plot, Violin plot
• Categorical: Bar chart, Count plot

Two Variables (Bivariate):
• Both Continuous: Scatter plot, Line chart, Hexbin, 2D histogram
• Continuous + Categorical: Box plot, Violin, Strip, Swarm
• Both Categorical: Heatmap, Grouped bar chart

Multiple Variables (Multivariate):
• Pair plot (scatterplot matrix)
• Parallel coordinates
• Heatmap correlation matrix
• Faceted plots (small multiples)

Common Chart Mistakes

⚠️ Pie charts for many categories - Use bar chart instead

⚠️ 3D effects on 2D data - Distorts perception

⚠️ Truncated Y-axis - Exaggerates differences

⚠️ Rainbow color scales - Not perceptually uniform

🔬 Matplotlib Figure Anatomy

Understanding Matplotlib's object hierarchy is key to creating professional visualizations.

Hierarchical Structure:
Figure → Axes → Axis → Tick → Label

• Figure: The overall window/canvas
• Axes: The actual plot area (NOT the X/Y axis!)
• Axis: The X or Y axis with ticks and labels
• Artist: Everything visible (lines, text, patches)

Two Interfaces

1. pyplot (MATLAB-style): Quick, implicit state
plt.plot(x, y)
plt.xlabel('Time')
plt.show()

2. Object-Oriented (OO): Explicit, recommended for complex plots
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel('Time')

✅ Always use OO interface for publication-quality plots.

📈 Basic Matplotlib Plots

Master the fundamental plot types that form the foundation of data visualization.

Code Examples

Line Plot:
ax.plot(x, y, color='blue', linestyle='--', marker='o', label='Series A')

Scatter Plot:
ax.scatter(x, y, c=colors, s=sizes, alpha=0.7, cmap='viridis')

Bar Chart:
ax.bar(categories, values, color='steelblue', edgecolor='black')

Histogram:
ax.hist(data, bins=30, edgecolor='white', density=True)

🔲 Subplots & Multi-panel Layouts

Combine multiple visualizations into a single figure for comprehensive analysis.

Methods:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
gs = fig.add_gridspec(3, 3); ax = fig.add_subplot(gs[0, :])

✅ Use plt.tight_layout() or fig.set_constrained_layout(True) to prevent overlaps.

🎨 Styling & Professional Themes

Transform basic plots into publication-quality visualizations.

Available Styles:
plt.style.available → Lists all built-in styles
plt.style.use('seaborn-v0_8-whitegrid')
with plt.style.context('dark_background'):

Color Palettes

Perceptually Uniform: viridis, plasma, inferno, magma, cividis
Sequential: Blues, Greens, Oranges (for magnitude)
Diverging: coolwarm, RdBu (for +/- deviations)
Categorical: tab10, Set2, Paired (discrete groups)

🌊 Seaborn: Statistical Visualization

Seaborn is a high-level library built on Matplotlib that makes statistical graphics beautiful and easy.

Why Seaborn?

Beautiful default styles and color palettes
Works seamlessly with Pandas DataFrames
Statistical estimation built-in (confidence intervals, regression)
Faceting for multi-panel figures
Functions organized by plot purpose

Seaborn Function Categories

Figure-level: Create entire figures (displot, relplot, catplot)
Axes-level: Draw on specific axes (histplot, scatterplot, boxplot)

By Purpose:
• Distribution: histplot, kdeplot, ecdfplot, rugplot
• Relationship: scatterplot, lineplot, regplot
• Categorical: stripplot, swarmplot, boxplot, violinplot, barplot
• Matrix: heatmap, clustermap

📊 Distribution Plots

Visualize the distribution of a single variable or compare distributions across groups.

Histogram vs KDE:
• Histogram: Discrete bins, shows raw counts
• KDE: Smooth curve, estimates probability density
• Use both together: sns.histplot(data, kde=True)

💡 ECDF (Empirical Cumulative Distribution Function) avoids binning issues entirely.

🔗 Relationship Plots

Explore relationships between two or more continuous variables.

Key Functions:
sns.scatterplot(data=df, x='x', y='y', hue='category', size='magnitude')
sns.regplot(data=df, x='x', y='y', scatter_kws={'alpha':0.5})
sns.pairplot(df, hue='species', diag_kind='kde')

📦 Categorical Plots

Visualize distributions and comparisons across categorical groups.

When to Use:
• Strip/Swarm: Show all data points (small datasets)
• Box: Summary statistics (median, quartiles, outliers)
• Violin: Full distribution shape + summary
• Bar: Mean/count with error bars

🔥 Heatmaps & Correlation Matrices

Visualize matrices of values using color intensity. Essential for EDA correlation analysis.

Best Practices:
• Always annotate with values: annot=True
• Use diverging colormap for correlation: cmap='coolwarm', center=0
• Mask upper/lower triangle: mask=np.triu(np.ones_like(corr))
• Square cells: square=True

💡 Clustermap automatically clusters similar rows/columns together.

🚀 Plotly Express: Interactive Visualization

Plotly creates interactive, web-based visualizations with zoom, pan, hover tooltips, and more.

Why Plotly?

Interactive out of the box (zoom, pan, select)
Hover tooltips with data details
Export as HTML, PNG, or embed in dashboards
Works in Jupyter, Streamlit, Dash
plotly.express is the high-level API (like Seaborn for Matplotlib)

Common Functions:
px.scatter(df, x='x', y='y', color='category', size='value', hover_data=['name'])
px.line(df, x='date', y='price', color='stock')
px.bar(df, x='category', y='count', color='group', barmode='group')
px.histogram(df, x='value', nbins=50, marginal='box')

🎬 Animated Visualizations

Add time dimension to your visualizations with animations.

Plotly Animation:
px.scatter(df, x='gdp', y='life_exp', animation_frame='year', animation_group='country', size='pop', color='continent')

Matplotlib Animation:
from matplotlib.animation import FuncAnimation
ani = FuncAnimation(fig, update_func, frames=100, interval=50)

✅ Hans Rosling's Gapminder is the classic example of animated scatter plots!

📱 Interactive Dashboards with Streamlit

Build interactive web apps for data exploration without web development experience.

Streamlit Basics:
streamlit run app.py

import streamlit as st
st.title("My Dashboard")
st.slider("Select value", 0, 100, 50)
st.selectbox("Choose", ["A", "B", "C"])
st.plotly_chart(fig)

💡 Streamlit auto-reruns when input changes - no callbacks needed!

🗺️ Geospatial Visualization

Visualize geographic data with maps, choropleth, and point plots.

Libraries:
• Plotly: px.choropleth(df, locations='country', color='value')
• Folium: Interactive Leaflet maps
• Geopandas + Matplotlib: Static maps with shapefiles
• Kepler.gl: Large-scale geospatial visualization

🎲 3D Visualization

Visualize three-dimensional relationships with surface plots, scatter plots, and more.

⚠️ 3D plots can obscure data. Often, multiple 2D views are more effective.

✅ Use Plotly for interactive 3D (rotate, zoom) instead of static Matplotlib 3D.

📖 Data Storytelling

Transform visualizations into compelling narratives that drive action.

The Data Storytelling Framework:

Context: Why does this matter? Who is the audience?
Data: What insights did you discover?
Narrative: What's the storyline (beginning, middle, end)?
Visual: Which chart best supports the story?
Call to Action: What should the audience do?

Design Principles

Remove Clutter: Eliminate chartjunk, gridlines, borders
Focus Attention: Use color strategically (grey + accent)
Think Like a Designer: Alignment, white space, hierarchy
Tell a Story: Title = conclusion, not description
Bad: "Sales by Region"
Good: "West Region Sales Dropped 23% in Q4"

💡 "If you can't explain it simply, you don't understand it well enough." — Einstein

✅ Read "Storytelling with Data" by Cole Nussbaumer Knaflic