Feature Engineering Techniques: A Beginner’s Guide to Build Better ML Features
Feature engineering is the critical process of transforming raw data into meaningful inputs (features) that significantly enhance machine learning (ML) models’ performance. By converting data, such as logs or text, into useful features like columns in a table or embedded vectors, you improve how effectively models learn patterns. This guide is designed for beginners eager to enhance their understanding of feature engineering and its impact on model success.
In this guide, you’ll explore:
- The importance of feature engineering for ML success.
- Different feature types and handling methods.
- Beginner-friendly techniques, including imputation, encoding, scaling, extraction, and selection.
- A structured workflow using tools such as pandas and scikit-learn.
- Two detailed walkthroughs: predicting house prices and classifying customer churn.
Along the way, you’ll encounter short code snippets, comparison tables for categorical encodings, and essential links to authoritative resources for digital experimentation.
Why Feature Engineering Matters
- Models derive their learning from patterns embedded in features. Transforming raw signals into informative features that reveal predictive structures often leads to better performance than merely switching to a more complex algorithm.
- For example, a timestamp column may offer limited value compared to derived features like hour-of-day or whether a date falls on a holiday. These modifications can uncover patterns typically hidden from the model.
- Feature engineering aids interpretability; domain-specific features (like customer tenure or purchase frequency) facilitate stakeholder comprehension of model outputs.
- Moreover, engineered features clarify deployment needs, defining exactly what calculations are necessary in production.
- However, deep learning models on extensive raw data (images, text, audio) can automatically learn features. Nonetheless, careful preprocessing and the choice of good features (like augmentations and embeddings) remain crucial.
Feature Types and Their Handling
Different feature types necessitate distinct treatments before they can be input to models.
Numerical Features
- Continuous vs discrete: Scale continuous values (age, salary) as needed, while discrete counts (number of visits) require alternative handling.
- Common issues include different scales (meters vs dollars), outliers, and missing values.
Categorical Features
- Nominal (unordered) vs ordinal (ordered): Apply ordinal encoding when order matters (e.g., education level) and nominal encoding otherwise.
- Consider cardinality: low-cardinality (few unique values) vs high-cardinality (many unique values like user IDs); high-cardinality features may require techniques like target encoding or hashing.
Datetime Features
- Decompose timestamps into year, month, day, hour, weekday, and seasonal indicators.
- For cyclical features, use sine/cosine transformations:
import numpy as np
hour = df['timestamp'].dt.hour
df['hour_sin'] = np.sin(2 * np.pi * hour / 24)
df['hour_cos'] = np.cos(2 * np.pi * hour / 24)
Text Features
- Utilize bag-of-words, TF-IDF, or n-grams for basic techniques. Preprocess with tokenization, lowercasing, and possibly stopword removal or lemmatization.
- For more advanced techniques, consider pretrained embeddings like word2vec or BERT. Beginners interested in embeddings should check this guide on working with language models.
Image and Sensor Features
- Differentiate between pixel-level features (raw images) and learned features (CNN embeddings). Refer to this resource on camera and sensor technology for insight into feature selection influencing factors (noise, dynamic range).
- Employ transfer learning (using pretrained CNN backbones) to extract embeddings rather than manually crafting pixel features when feasible.
Interaction and Aggregated Features
- Feature crossings (for example, city × device type) and group aggregates (e.g., customer mean purchase value) often reveal hidden patterns absent in raw data.
- Aggregations are vital for transactional/time-series data, converting event-level information into entity-level features like for customers or items.
Common Feature Engineering Techniques
Below are practical techniques with tips and examples. I link to scikit-learn’s preprocessing documentation for further information: Scikit-learn Preprocessing Docs.
1) Missing Value Handling (Imputation)
- Basic: numeric → mean/median; categorical → mode or a new category labeled “missing”.
- Advanced: KNN imputation and IterativeImputer (model-based). It’s advisable to add a missing indicator column to identify rows with imputed values.
- Understand why data is missing (MCAR, MAR, MNAR) to avoid leaks of target information.
Example (scikit-learn pipeline):
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
num_imputer = SimpleImputer(strategy='median')
cat_imputer = SimpleImputer(strategy='constant', fill_value='missing')
preprocessor = ColumnTransformer([
('num', num_imputer, numeric_cols),
('cat', cat_imputer, categorical_cols)
])
2) Scaling and Normalization
- Use StandardScaler (zero mean, unit variance), MinMaxScaler (0 to 1), or RobustScaler (handles outliers).
- Scaling is important for distance-based models (KNN), regularized linear models, and gradient-based optimizers.
3) Encoding Categorical Variables
- One-hot encoding suits low-cardinality nominal features.
- Use ordinal encoding for ordered categories and target encoding (mean) for high-cardinality categories, ensuring to manage the risk of leakage through cross-validated smoothing.
- For extremely high-cardinality features, consider the hashing trick for memory efficiency.
| Encoding | Use When | Pros | Cons |
|---|---|---|---|
| One-hot | Low cardinality nominal | Simple, interpretable | Dimensionality explosion |
| Ordinal | Ordered categories | Preserves order | Assumes numeric spacing |
| Target Encoding | High-cardinality categories | Compact, predictive | Risk of target leakage |
| Hashing | Very high cardinality | Memory efficient | Collisions and less interpretable |
Example of one-hot with scikit-learn:
from sklearn.preprocessing import OneHotEncoder
onehot = OneHotEncoder(handle_unknown='ignore', sparse=False)
4) Binning and Discretization
- Convert continuous variables into bins (equal-width or quantiles) to reduce sensitivity to outliers and enhance model robustness.
Example (pandas):
df['age_bin'] = pd.qcut(df['age'], q=4, labels=False) # Quartiles
5) Polynomial and Interaction Features
- Use PolynomialFeatures or create pairwise interactions to allow linear models to capture non-linear relationships, watching for combinatorial explosions in dimensionality.
from sklearn.preprocessing import PolynomialFeatures
poly = PolynomialFeatures(degree=2, interaction_only=True, include_bias=False)
6) Feature Extraction and Dimensionality Reduction
- Consider PCA for numerical data, TruncatedSVD for sparse textual data, and autoencoders for non-linear compression.
- Methods like t-SNE and UMAP are excellent for visualization but typically aren’t directly used as inputs for supervised tasks.
7) Text Feature Engineering
- Basic techniques like Bag-of-Words and TF-IDF serve classic ML models; n-grams and character n-grams are useful for handling small texts.
- For semantic comprehension, use pretrained embeddings. Beginners can enhance their skills in feature engineering through Kaggle’s course on Feature Engineering.
Example: TF-IDF with scikit-learn:
from sklearn.feature_extraction.text import TfidfVectorizer
vec = TfidfVectorizer(ngram_range=(1,2), max_features=5000)
X_text = vec.fit_transform(df['text'])
8) Time-Series and Lag Features
- Create lag features (previous values), rolling aggregates, differences, and seasonal indicators, being cautious about data alignment to avoid leakage.
Simple lag creation with pandas:
df = df.sort_values(['customer_id', 'date'])
df['purchase_lag_1'] = df.groupby('customer_id')['purchase_amount'].shift(1)
9) Automated Feature Engineering
- Tools like Featuretools leverage Deep Feature Synthesis to automate the generation of aggregates and transformation features. For an accessible introduction to DFS, check this overview on Deep Feature Synthesis.
10) Feature Selection Methods
- Explore filter methods (correlation thresholding, variance threshold), wrapper methods (Recursive Feature Elimination), and embedded methods (L1 regularization, tree-based importances).
- Conduct feature selection within cross-validation folds to prevent selection bias.
Practical Workflow & Tools
A structured feature engineering workflow promotes reproducibility and minimizes leakage.
Typical pipeline includes:
- Exploratory Data Analysis (EDA): Examine distributions, missing data, and correlations.
- Cleaning: Adjust data types, manage missing values, and eliminate duplicates.
- Transformation: Apply scaling, encoding, and create new features.
- Selection: Reduce feature dimensions and eliminate noise.
- Modeling & Validation: Implement cross-validation and fine-tune hyperparameters.
- Monitoring: Assess feature drift and model performance in production.
Utilize scikit-learn’s Pipelines and ColumnTransformer to encapsulate preprocessing and modeling, ensuring no training information is leaked into validation:
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
pipeline = Pipeline([
('pre', preprocessor),
('model', RandomForestRegressor())
])
Key tools for beginners:
- pandas for data manipulation.
- scikit-learn for preprocessing and modeling (check the preprocessing guide).
- Featuretools for automated feature synthesis (see previous overview).
- TSFresh for automatic extraction of time-series features.
- spaCy and Hugging Face Transformers for text embeddings and NLP tasks (see the internal guide on embeddings).
Achieving reproducibility and a quality environment:
- Track datasets and transformations via DVC or MLflow, or even version scripts.
- If you’re setting experiments locally, consult the hardware guidelines here: Building Home Lab.
- For Windows users wishing to utilize Linux tools, consider installing WSL and configuring it for your development: WSL Configuration Guide.
Simple Walkthroughs (2 mini-cases)
These mini-cases provide practical decisions along with code snippets to aid understanding.
Case A: House Price Prediction (Regression)
Goal: Predict house sale prices using tabular data (both numerical and categorical).
Steps:
- EDA: Analyze missingness and skewness (e.g., sale price is frequently skewed).
- Imputation:
- Numeric: use median imputation.
- Categorical: introduce a ‘missing’ category or use the mode based on semantics.
- Transformations:
- Log-transform the target variable if skewed:
y = np.log1p(y). - Apply log transformations to skewed numeric features (e.g., living area).
- Log-transform the target variable if skewed:
- Encoding:
- One-hot encode low-cardinality features such as ‘roof_style’.
- Utilize target encoding on high-cardinality features like ‘neighbourhood’, using K-fold smoothing.
- Interaction Features:
- For example, compute
living_area * num_roomsorage_of_house = year_sold - year_built.
- For example, compute
- Selection & Model:
- Train tree-based models (like RandomForest/GradientBoosting) and inspect feature importances.
- Prune features that show little importance.
Pseudocode pipeline:
# Simple sketch
preprocessor = ColumnTransformer([...])
pipeline = Pipeline([('pre', preprocessor), ('model', GradientBoostingRegressor())])
Quick wins in house price predictions can include log transformations, median imputations, and selecting a few interaction terms.
Case B: Customer Churn (Classification)
Goal: Predict whether customers will churn using transactional logs and account metadata.
Steps:
- Aggregate transactional data into customer-level features: compute monthly spend, recency, frequency, and average order value.
agg = transactions.groupby('customer_id').agg({ 'amount': ['sum', 'mean'], 'date': ['max', 'min', 'count'] }) # Calculate recency, tenure, frequency - Define time-based features: tenure (days since sign-up), days since last purchase, and seasonal indicators (by month).
- Encoding: apply target encoding to plan types, leveraging cross-validation folds to mitigate leakage.
- Address class imbalances using stratified cross-validation and focus on metrics like AUC or F1.
- Feature selection: adopt tree-based importances or L1-regularized models for noise reduction.
Quick wins for churn predictions include incorporating recency, frequency, and monetary (RFM) aggregates as well as days since the last purchase, which are generally highly predictive.
Pitfalls, Evaluation & Best Practices
- Avoid Data Leakage: Never calculate statistics (e.g., mean target per category) over the full dataset and then apply them to training; use cross-validation schemes for target encoding and selection.
- Overfitting: Engineered features might capture peculiarities of the training dataset. Regularly employ cross-validation and regularization; keep features simple where possible.
- Feature Stability: Verify if features maintain predictive power over time and monitor distributions in production to detect drift.
- Interpretability: Favor explainable features to facilitate troubleshooting and clear communication with stakeholders.
- Documentation and Reproducibility: Record transformation steps thoroughly and implement pipelines, as small differences in feature calculations can significantly affect model performance.
Checklist Before Deploying Features
- Are all transformations encapsulated in a pipeline?
- Are imputation and encoding processes reproducible in production?
- Is there any leakage from the target or time?
- Have cross-validation steps for selection and tuning been executed?
- Are feature distributions and model metrics monitored post-deployment?
Resources & Next Steps
- Explore the Scikit-learn preprocessing guide for implementation insights and usage of Pipelines and ColumnTransformer.
- Review the Deep Feature Synthesis / Featuretools overview for automated feature engineering.
- Engage with Kaggle Learn’s Feature Engineering course for hands-on practice.
- For further resources, examine our internal guide on working with language models.
- Setting up a local development environment? See our guides on building a home lab and installing WSL.
Glossary (Quick)
- Imputation: Filling in missing values.
- Target Leakage: Occurs when training features include information not available at prediction time, causing overly optimistic performance metrics.
- Cardinality: The number of distinct values within a categorical feature.
- CV Smoothing: Merging per-category statistics with overall statistics based on category counts to lessen variance.
Calls to Action
- Primary: Test these techniques on a small dataset from Kaggle (such as the House Prices dataset), implementing a scikit-learn pipeline that includes imputation, encoding, and feature selection.
- Secondary: Share your results or any questions in the comments of this article and link to your notebook for community feedback.