🎯 Why Visualize Data?

Data visualization transforms abstract numbers into visual stories. The human brain processes images 60,000Γ— faster than text. Visualization helps us explore, analyze, and communicate data effectively.

Anscombe's Quartet: Four datasets with nearly identical statistical properties (mean, variance, correlation) that look completely different when plotted. This demonstrates why visualization is essential - statistics alone can be misleading!

Three Purposes of Visualization

1. Exploratory: Discover patterns, anomalies, and insights in your data
2. Explanatory: Communicate findings to stakeholders clearly
3. Confirmatory: Verify hypotheses and validate models
πŸ’‘ "The greatest value of a picture is when it forces us to notice what we never expected to see." β€” John Tukey
βœ… Always start with visualization before building ML models.

πŸ‘οΈ Visual Perception & Pre-attentive Attributes

The human visual system can detect certain visual attributes almost instantly (< 250ms) without conscious effort. These are called pre-attentive attributes.

Pre-attentive Attributes:
  • Position: Most accurate for quantitative data (use X/Y axes)
  • Length: Bar charts leverage this effectively
  • Color Hue: Best for categorical distinctions
  • Color Intensity: Good for gradients/magnitude
  • Size: Bubble charts, but humans underestimate area
  • Shape: Useful for categories, but limit to 5-7 shapes
  • Orientation: Lines, angles

Cleveland & McGill's Accuracy Ranking

Most Accurate β†’ Least Accurate:
1. Position on common scale (bar chart)
2. Position on non-aligned scale (multiple axes)
3. Length (bar)
4. Angle, Slope
5. Area
6. Volume, Curvature
7. Color saturation, Color hue
⚠️ Pie charts use angle (low accuracy). Bar charts are almost always better!
βœ… Use position for most important data, color for categories.

πŸ“ The Grammar of Graphics

The Grammar of Graphics (Wilkinson, 1999) is a framework for describing statistical graphics. It's the foundation of ggplot2 (R) and influences Seaborn, Altair, and Plotly.

Components of a Graphic:
  • Data: The dataset being visualized
  • Aesthetics (aes): Mapping data to visual properties (x, y, color, size)
  • Geometries (geom): Visual elements (points, lines, bars, areas)
  • Facets: Subplots by categorical variable
  • Statistics: Transformations (binning, smoothing, aggregation)
  • Coordinates: Cartesian, polar, map projections
  • Themes: Non-data visual elements (fonts, backgrounds)
πŸ’‘ Understanding Grammar of Graphics makes you a better visualizer in ANY library.

🎨 Choosing the Right Chart

The best visualization depends on your data type and question. Here's a decision guide:

Single Variable (Univariate):
β€’ Continuous: Histogram, KDE, Box plot, Violin plot
β€’ Categorical: Bar chart, Count plot

Two Variables (Bivariate):
β€’ Both Continuous: Scatter plot, Line chart, Hexbin, 2D histogram
β€’ Continuous + Categorical: Box plot, Violin, Strip, Swarm
β€’ Both Categorical: Heatmap, Grouped bar chart

Multiple Variables (Multivariate):
β€’ Pair plot (scatterplot matrix)
β€’ Parallel coordinates
β€’ Heatmap correlation matrix
β€’ Faceted plots (small multiples)

Common Chart Mistakes

⚠️ Pie charts for many categories - Use bar chart instead
⚠️ 3D effects on 2D data - Distorts perception
⚠️ Truncated Y-axis - Exaggerates differences
⚠️ Rainbow color scales - Not perceptually uniform

πŸ”¬ Matplotlib Figure Anatomy

Understanding Matplotlib's object hierarchy is key to creating professional visualizations.

Hierarchical Structure:
Figure β†’ Axes β†’ Axis β†’ Tick β†’ Label

β€’ Figure: The overall window/canvas
β€’ Axes: The actual plot area (NOT the X/Y axis!)
β€’ Axis: The X or Y axis with ticks and labels
β€’ Artist: Everything visible (lines, text, patches)

Two Interfaces

1. pyplot (MATLAB-style): Quick, implicit state
plt.plot(x, y)
plt.xlabel('Time')
plt.show()

2. Object-Oriented (OO): Explicit, recommended for complex plots
fig, ax = plt.subplots()
ax.plot(x, y)
ax.set_xlabel('Time')
βœ… Always use OO interface for publication-quality plots.

πŸ“ˆ Basic Matplotlib Plots

Master the fundamental plot types that form the foundation of data visualization.

Code Examples

Line Plot:
ax.plot(x, y, color='blue', linestyle='--', marker='o', label='Series A')

Scatter Plot:
ax.scatter(x, y, c=colors, s=sizes, alpha=0.7, cmap='viridis')

Bar Chart:
ax.bar(categories, values, color='steelblue', edgecolor='black')

Histogram:
ax.hist(data, bins=30, edgecolor='white', density=True)

πŸ”² Subplots & Multi-panel Layouts

Combine multiple visualizations into a single figure for comprehensive analysis.

Methods:
fig, axes = plt.subplots(2, 2, figsize=(12, 10))
fig, axes = plt.subplots(2, 2, sharex=True, sharey=True)
gs = fig.add_gridspec(3, 3); ax = fig.add_subplot(gs[0, :])
βœ… Use plt.tight_layout() or fig.set_constrained_layout(True) to prevent overlaps.

🎨 Styling & Professional Themes

Transform basic plots into publication-quality visualizations.

Available Styles:
plt.style.available β†’ Lists all built-in styles
plt.style.use('seaborn-v0_8-whitegrid')
with plt.style.context('dark_background'):

Color Palettes

Perceptually Uniform: viridis, plasma, inferno, magma, cividis
Sequential: Blues, Greens, Oranges (for magnitude)
Diverging: coolwarm, RdBu (for +/- deviations)
Categorical: tab10, Set2, Paired (discrete groups)

🌊 Seaborn: Statistical Visualization

Seaborn is a high-level library built on Matplotlib that makes statistical graphics beautiful and easy.

Why Seaborn?
  • Beautiful default styles and color palettes
  • Works seamlessly with Pandas DataFrames
  • Statistical estimation built-in (confidence intervals, regression)
  • Faceting for multi-panel figures
  • Functions organized by plot purpose

Seaborn Function Categories

Figure-level: Create entire figures (displot, relplot, catplot)
Axes-level: Draw on specific axes (histplot, scatterplot, boxplot)

By Purpose:
β€’ Distribution: histplot, kdeplot, ecdfplot, rugplot
β€’ Relationship: scatterplot, lineplot, regplot
β€’ Categorical: stripplot, swarmplot, boxplot, violinplot, barplot
β€’ Matrix: heatmap, clustermap

πŸ“Š Distribution Plots

Visualize the distribution of a single variable or compare distributions across groups.

Histogram vs KDE:
β€’ Histogram: Discrete bins, shows raw counts
β€’ KDE: Smooth curve, estimates probability density
β€’ Use both together: sns.histplot(data, kde=True)
πŸ’‘ ECDF (Empirical Cumulative Distribution Function) avoids binning issues entirely.

πŸ”— Relationship Plots

Explore relationships between two or more continuous variables.

Key Functions:
sns.scatterplot(data=df, x='x', y='y', hue='category', size='magnitude')
sns.regplot(data=df, x='x', y='y', scatter_kws={'alpha':0.5})
sns.pairplot(df, hue='species', diag_kind='kde')

πŸ“¦ Categorical Plots

Visualize distributions and comparisons across categorical groups.

When to Use:
β€’ Strip/Swarm: Show all data points (small datasets)
β€’ Box: Summary statistics (median, quartiles, outliers)
β€’ Violin: Full distribution shape + summary
β€’ Bar: Mean/count with error bars

πŸ”₯ Heatmaps & Correlation Matrices

Visualize matrices of values using color intensity. Essential for EDA correlation analysis.

Best Practices:
β€’ Always annotate with values: annot=True
β€’ Use diverging colormap for correlation: cmap='coolwarm', center=0
β€’ Mask upper/lower triangle: mask=np.triu(np.ones_like(corr))
β€’ Square cells: square=True
πŸ’‘ Clustermap automatically clusters similar rows/columns together.

πŸš€ Plotly Express: Interactive Visualization

Plotly creates interactive, web-based visualizations with zoom, pan, hover tooltips, and more.

Why Plotly?
  • Interactive out of the box (zoom, pan, select)
  • Hover tooltips with data details
  • Export as HTML, PNG, or embed in dashboards
  • Works in Jupyter, Streamlit, Dash
  • plotly.express is the high-level API (like Seaborn for Matplotlib)
Common Functions:
px.scatter(df, x='x', y='y', color='category', size='value', hover_data=['name'])
px.line(df, x='date', y='price', color='stock')
px.bar(df, x='category', y='count', color='group', barmode='group')
px.histogram(df, x='value', nbins=50, marginal='box')

🎬 Animated Visualizations

Add time dimension to your visualizations with animations.

Plotly Animation:
px.scatter(df, x='gdp', y='life_exp', animation_frame='year', animation_group='country', size='pop', color='continent')

Matplotlib Animation:
from matplotlib.animation import FuncAnimation
ani = FuncAnimation(fig, update_func, frames=100, interval=50)
βœ… Hans Rosling's Gapminder is the classic example of animated scatter plots!

πŸ“± Interactive Dashboards with Streamlit

Build interactive web apps for data exploration without web development experience.

Streamlit Basics:
streamlit run app.py

import streamlit as st
st.title("My Dashboard")
st.slider("Select value", 0, 100, 50)
st.selectbox("Choose", ["A", "B", "C"])
st.plotly_chart(fig)
πŸ’‘ Streamlit auto-reruns when input changes - no callbacks needed!

πŸ—ΊοΈ Geospatial Visualization

Visualize geographic data with maps, choropleth, and point plots.

Libraries:
β€’ Plotly: px.choropleth(df, locations='country', color='value')
β€’ Folium: Interactive Leaflet maps
β€’ Geopandas + Matplotlib: Static maps with shapefiles
β€’ Kepler.gl: Large-scale geospatial visualization

🎲 3D Visualization

Visualize three-dimensional relationships with surface plots, scatter plots, and more.

⚠️ 3D plots can obscure data. Often, multiple 2D views are more effective.
βœ… Use Plotly for interactive 3D (rotate, zoom) instead of static Matplotlib 3D.

πŸ“– Data Storytelling

Transform visualizations into compelling narratives that drive action.

The Data Storytelling Framework:
  1. Context: Why does this matter? Who is the audience?
  2. Data: What insights did you discover?
  3. Narrative: What's the storyline (beginning, middle, end)?
  4. Visual: Which chart best supports the story?
  5. Call to Action: What should the audience do?

Design Principles

Remove Clutter: Eliminate chartjunk, gridlines, borders
Focus Attention: Use color strategically (grey + accent)
Think Like a Designer: Alignment, white space, hierarchy
Tell a Story: Title = conclusion, not description
Bad: "Sales by Region"
Good: "West Region Sales Dropped 23% in Q4"
πŸ’‘ "If you can't explain it simply, you don't understand it well enough." β€” Einstein
βœ… Read "Storytelling with Data" by Cole Nussbaumer Knaflic