Exploratory Data Analysis (EDA) Techniques: A Beginner’s Guide to Tools, Steps, and Best Practices

Updated on
10 min read

Exploratory Data Analysis, or EDA, is a crucial process in data science that involves inspecting a dataset to uncover patterns, anomalies, and relationships. This article is designed for beginners eager to understand EDA and provides practical techniques for analyzing data using popular tools like Python’s pandas and seaborn. You’ll learn the importance of EDA in the data science workflow and gain insights on cleaning, visualizing, and summarizing data effectively. By the end of this guide, you will be equipped to perform EDA confidently and avoid common pitfalls.

Why EDA Matters — Benefits and Goals

EDA serves important goals that influence both technical outcomes and business decisions. Here are typical objectives:

  • Error Detection: Identify inconsistencies such as missing values and duplicate entries.
  • Distribution Analysis: Understand the distribution of variables and outliers that may affect model selection.
  • Hypothesis Generation: Formulate domain-related questions and segmentations.
  • Model Selection: Determine suitable modeling approaches and necessary feature transformations.

The key benefits of EDA include:

  • Cleaner Data: Reducing unexpected surprises during modeling, like data leakage.
  • Enhanced Model Performance: Achieving better model results through informed feature selection.
  • Alignment with Business Goals: Ensuring modeling objectives are in sync with stakeholder needs.

EDA is an iterative process; each insight often leads to new questions. Maintain documentation to ensure reproducibility in later stages like feature engineering and modeling.

Tools and Environment Setup — What Beginners Need

For beginners, a recommended toolkit includes:

  • Programming Language: Python
  • Data Wrangling Library: pandas
  • Visualization Libraries: seaborn and matplotlib
  • Interactive Environments: Jupyter Notebook or VS Code (with Python extension)

You can install the necessary packages using pip:

pip install pandas matplotlib seaborn jupyterlab

Important documentation to bookmark:

Alternatives:

  • R with tidyverse: Ideal for statisticians, it includes dplyr and ggplot2.
  • Spreadsheets: Excel or Google Sheets can efficiently manage small datasets.

Environment Tips:

  • Jupyter notebooks work well for iterative exploration that blends code, visuals, and text.
  • For Windows users, consider using WSL for a Linux-like experience (see installation guide here).

Getting to Know Your Data — Initial Checks

Begin by loading your dataset and performing initial inspections to grasp its structure and data types.

Example initial checks with pandas:

import pandas as pd

# Load data
df = pd.read_csv('data.csv')

# Quick checks
df.head()
df.info()
df.describe()

# Percentage of missing values per column
(df.isnull().sum() / len(df)) * 100

Pay attention to:

  • The number of rows and columns to verify if they match expectations.
  • Data types of columns (e.g., object, int64, float64, datetime64).
  • Unexpected constants or high cardinality in columns.
  • Significant percentages of missing values in key columns.

Sanity Checks:

  • Check for duplicates using df.duplicated().sum().
  • Inspect unique values, e.g., df['age'].unique().

Ensuring correct data types early on, especially for dates and categories, will streamline grouping and plotting phases.

Data Cleaning Essentials

Cleaning the data is often the most time-consuming aspect of EDA. Here’s a breakdown of essential steps and practical code snippets:

1) Missing Values Visualization

# Percentage of missing values
missing_pct = (df.isnull().sum() / len(df)) * 100
missing_pct.sort_values(ascending=False).head(10)

# Visualize missingness
missing_pct.plot.barh()

To delve deeper, you can use packages like missingno for advanced visualizations.

Strategies for handling missing values include:

  • Drop rows/columns with high missingness that cannot be recovered.
  • Impute with mean/median for numeric data with slight missingness.
  • Use the mode for categorical variables.
  • Apply domain-specific imputation when applicable.
  • Create flags to denote originally missing values for modeling.

2) Identifying and Removing Duplicates

# Identify duplicates on all columns
dups = df[df.duplicated()]

# Remove duplicates in place
df.drop_duplicates(inplace=True)

Be cautious, as some duplicates may be legitimate (e.g., repeated transactions).

3) Correcting Data Types and Parsing Dates

# Convert column to datetime
df['date'] = pd.to_datetime(df['date'], errors='coerce')

# Convert to categorical
df['region'] = df['region'].astype('category')

4) Dealing with Outliers

  • Use boxplots and the IQR rule (Q1 - 1.5IQR, Q3 + 1.5IQR) to flag outliers.
  • Differentiate between valid extreme observations and data entry errors.
  • Avoid blindly removing outliers; document your rationale and consider alternatives like winsorizing.
import seaborn as sns
sns.boxplot(x=df['salary'])

Univariate Analysis — Understanding Single Variables

Univariate EDA focuses on analyzing each variable independently. Key summary statistics include:

  • Central Tendency: Mean, median, mode
  • Spread: Variance, standard deviation, IQR
  • Shape: Skewness and kurtosis for distribution insights

When dealing with skewed distributions, it’s advisable to use the median rather than the mean.

Visualizations:

  • Histograms: For continuous variables (use sns.histplot).
  • Boxplots: To visualize spread and outliers (use sns.boxplot).
  • Bar Charts: For categorical variables (use value_counts()).
import seaborn as sns
sns.histplot(df['age'].dropna(), kde=True)

# Categorical counts
df['embarked'].value_counts().plot.bar()

Key Transformations:

  • Employ log or square-root transformations to stabilize variance.
import numpy as np
# Log transform a positive column
df['log_income'] = np.log1p(df['income'])

Interpretation of Skewness and Kurtosis:

Skewness indicates symmetry, while kurtosis assesses the weight of the tails compared to a normal distribution.

Bivariate Analysis — Relationships Between Variables

Bivariate analysis examines relationships between pairs of variables. For example:

  • Numerical vs Numerical: Analyze correlation and visualize with scatter plots.
# Calculate correlation
df.corr()  # Pearson by default

# Scatter plot
sns.scatterplot(x='age', y='income', data=df)
  • Categorical vs Numerical: Use grouped boxplots.
sns.boxplot(x='pclass', y='age', data=df)
  • Categorical vs Categorical: Generate crosstabs to examine joint distributions.
pd.crosstab(df['sex'], df['survived'], normalize='index')

Be mindful that correlation does not imply causation; use insights to guide modeling.

Multivariate Analysis & Dimensionality Basics

With multiple variables, pairwise checks can become impractical. Consider these tools:

  • Pairplot (sns.pairplot) for small feature sets (3-6 numeric features).
  • Correlation Heatmap for a quick overview of relationships.
sns.pairplot(df[['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'species']])

sns.heatmap(df.corr(), annot=True, fmt='.2f', cmap='coolwarm')

Caution with High-Dimensional Data:

As dimensionality increases, spurious correlations become more likely. Always validate significant discoveries on holdout data.

Dimensionality Reduction:

PCA (Principal Component Analysis) helps summarize variance into a few orthogonal components, useful for visualization or reducing dimensionality before clustering.

Feature Engineering Basics

Feature engineering connects EDA to modeling. Here’s a few simple ideas:

  • Date Features: Extract year, month, day of the week, or weekend flags.
df['month'] = df['date'].dt.month
  • Text Analysis: Measure length, token counts, or presence indicators.
  • Combine Columns: For instance, compute family_size as siblings + parents + 1 (as in the Titanic example).
  • Binning: Group continuous variables into categories for better capture of non-linear effects.

Encoding Categorical Variables:

  • Use one-hot encoding (e.g., pd.get_dummies()) for nominal categories.
  • Apply ordinal encoding for ordered categories. For high-cardinality columns, consider target encoding or hashing.

Scaling Numeric Features:

  • Implement standardization (z-score) or MinMax scaling, especially for models that require feature scale normalization, such as KNN or neural networks.

Visualization Best Practices

For effective communication, choose the right plot based on your investigative questions:

  • Distributions: Use histograms or KDE plots.
  • Spread and Outliers: Boxplots are ideal.
  • Relationships: Opt for scatter plots or line plots for time series.
  • Correlation: Utilize heatmaps.

Design Considerations:

  • Clearly label axes and titles; avoid misleading scales.
  • Ensure high contrast for readability and accessibility in reports.
sns.set_palette('colorblind')
fig = sns.histplot(df['age'].dropna())
fig.set_title('Age Distribution')

Practical EDA Workflow — Step-by-Step Checklist

A streamlined workflow to follow for new datasets:

  1. Load data and inspect its shape and initial rows (df.head()).
  2. Check data types (df.info()) and convert as necessary.
  3. Assess missing values and their distributions.
  4. Identify and flag duplicates.
  5. Produce basic summary statistics for numeric and categorical variables.
  6. Visualize univariate distributions.
  7. Perform bivariate analysis.
  8. Investigate outliers and decide how to address them.
  9. Try transformations and create simple new features.
  10. Document findings in Markdown cells, capturing hypotheses and next steps.
  11. Save plots and versioned notebooks (consider using containers for reproducibility).

Tip:

Maintain a dedicated Findings section for summarizing top issues, implications for modeling, and recommended features.

Hands-on Example Walkthrough (Titanic)

Let’s apply the EDA process using the Titanic dataset:

  1. Loading and Initial Checks:
import pandas as pd
import seaborn as sns

df = pd.read_csv('titanic.csv')
df.head()
df.info()
(df.isnull().sum() / len(df)) * 100
  1. Univariate Checks: Examine Age (histogram and boxplot) and Fare (consider log transform). Check embarked values for missingness.

  2. Bivariate Checks: Analyze survival rates by sex using crosstabs or bar plots.

sns.barplot(x='sex', y='survived', data=df)
  • Age vs Survival: Use boxplots to compare.
sns.boxplot(x='survived', y='age', data=df)
  1. Feature Engineering: Create a family_size column.
df['family_size'] = df['sibsp'] + df['parch'] + 1
  1. Correlation Heatmap for numeric features to identify relationships.
sns.heatmap(df[['survived','age','fare','family_size']].corr(), annot=True)

Observations:

  • Important variables like sex and passenger class influence survival significantly and should be encoded accurately.
  • Fare may need a log transformation for normality assumptions in modeling.
  • Missing ages can be imputed based on passenger class or flagged.

You can access the runnable Jupyter notebook for the Titanic EDA example here (search for “titanic_eda.ipynb”).

Common Pitfalls and How to Avoid Them

  • Causation from Correlation: Always clarify that correlations do not indicate causal relationships without controlled experiments.
  • Sampling Bias and Class Imbalance: Ensure you check for distribution in target classes and the methods used for sampling.
  • Blindly Removing Outliers: Carefully assess the validity of outliers before taking action — document reasoning.

Mitigation Strategies:

  • Keep clear documentation of assumptions and sample sources.
  • Validate discoveries using holdout data.
  • Communicate uncertainties and potential limitations to stakeholders.

Resources, Next Steps & Conclusion

For further reading, consider these resources:

Practical Projects and Datasets:

  • Explore datasets on Kaggle (Titanic, House Prices, Iris) to practice EDA and feature engineering.
  • Examine public COVID-19 case data for time-series analysis.

If you enjoyed this guide, consider trying the following:

  • Implement the EDA checklist on a new dataset and document your discoveries.
  • Attempt simple modeling after EDA, using lightweight toolkits — refer to our guide on tools like SmollM2 and Hugging Face: Find out more.
  • Create a presentation summarizing your EDA findings using effective visual storytelling techniques: See our guide.

Quick EDA Checklist (Cheatsheet)

EDA Cheatsheet — Copy this into your notebook

  1. Load data, check df.head(), df.info(), and df.shape
  2. Check for missing values and duplicates
  3. Convert datatypes (dates, categoricals)
  4. Produce univariate plots (histograms, boxplots, bar charts)
  5. Conduct bivariate checks (scatter plots, grouped boxplots, crosstabs)
  6. Generate a correlation heatmap for numeric features
  7. Identify outliers and decide further action
  8. Apply transformations and generate new features
  9. Document findings and save your notebook

References

Related Internal Guides:

Happy exploring! If you run the Titanic notebook, feel free to share your notebook link in the comments.

TBO Editorial

About the Author

TBO Editorial writes about the latest updates about products and services related to Technology, Business, Finance & Lifestyle. Do get in touch if you want to share any useful article with our community.