Audio Enhancement Algorithms: A Beginner’s Guide to Noise Suppression, Dereverberation, and ML Techniques
Audio enhancement improves the clarity and quality of recorded or live audio by reducing noise, removing reverberation, and enhancing speech intelligibility. This guide covers core concepts, classic DSP methods (spectral subtraction, Wiener filtering, adaptive filters), and modern machine learning (ML) approaches for noise suppression and dereverberation. It’s written for beginners, audio engineers, developers, podcasters, and anyone building real-time or batch speech-enhancement systems. Expect practical examples, libraries to try, evaluation metrics, deployment tips, and a FAQ/troubleshooting section.
Introduction — What is Audio Enhancement and Why It Matters
Audio enhancement includes techniques that increase perceived quality or intelligibility of audio. Common applications:
- Phone/VoIP calls (noise suppression, echo cancellation)
- Conference systems and streaming (beamforming, dereverberation)
- Podcasts and content creation (denoise, de-click, normalize)
- Assistive devices (hearing aids, cochlear implants)
High-level goals:
- Improve intelligibility (make speech easier to understand)
- Reduce unwanted noise and reverberation
- Preserve naturalness and fidelity (avoid robotic or distorted speech)
When to use DSP vs ML:
- Use classic DSP (spectral subtraction, Wiener filtering, adaptive filters) when you need low latency, low compute, or have limited training data (embedded devices, hearing aids).
- Use ML approaches when you have paired noisy/clean data and compute for higher-quality results with complex or non-stationary noise.
Immediate takeaway: choose methods based on latency, compute, and available data.
Basic Audio & Signal Processing Concepts for Beginners
Waveform vs. spectrogram
- Waveform: raw time-domain signal (amplitude vs. time).
- Spectrogram: time–frequency view (magnitude of STFT). Many enhancement methods operate in the time–frequency domain because speech and noise separate more easily there.
Sampling rate, frames, windowing, STFT
- Sampling rate (e.g., 16 kHz) sets the maximum representable frequency.
- STFT slices audio into overlapping frames, windows (Hann, Hamming), and computes a complex spectrogram (magnitude + phase).
- Frame length trade-off: longer frames = better frequency resolution but more latency; shorter frames = lower latency but less resolution.
Common noise types
- Stationary noise: steady-state noises (hum, engine). Easier to remove.
- Non-stationary noise: intermittent/changing noise (keyboard clicks, music, babble). Harder to remove.
- Broadband vs. tonal: broadband covers many frequencies (white noise); tonal is narrowband (60 Hz hum).
Key terms
- SNR (Signal-to-Noise Ratio): speech power vs noise power.
- Artifacts: undesirable byproducts (e.g., musical noise from spectral subtraction).
- Latency: buffering and algorithmic delay — crucial for real-time apps.
Evaluation basics: SNR is simple, but perceptual metrics (PESQ, STOI) better reflect human hearing.
Classic DSP Algorithms (Non-ML)
Classic approaches are deterministic, lightweight, and well-suited to constrained environments.
Spectral subtraction
- Idea: estimate noise magnitude (often during pauses) and subtract it from the noisy spectrum.
- Steps: voice activity detection (VAD) → build noise spectral estimate → subtract with flooring to avoid negative magnitudes.
- Pros: simple and cheap.
- Cons: musical noise artifacts; requires parameter tuning.
Pseudocode (spectral subtraction):
for each frame:
X = STFT(frame)
if VAD(frame) == noise:
update_noise_estimate(|X|)
Y_mag = max(|X| - alpha * noise_mag, floor)
Y = Y_mag * exp(j * angle(X)) # reuse noisy phase
out_frame = ISTFT(Y)
overlap_add(out_frame)
Wiener filtering
- Treats enhancement as an MMSE estimation problem.
- Uses estimated speech-to-noise ratio per TF bin to compute gain: G = SNR / (SNR + 1).
- Produces smoother results than naive spectral subtraction and works well for stationary or slowly varying noises.
Adaptive filters and echo cancellation
- LMS/NLMS adaptive filters learn coefficients online to cancel known reference signals (e.g., echo path in hands-free devices).
- Echo cancellation uses a reference (speaker output) and adapts with double-talk detection.
- Lightweight and real-time friendly — standard in telephony.
Median-like filters and gating
- Time–frequency median filters suppress bursts or clicks by replacing magnitudes with medians over time or frequency.
- Simple gating (thresholding) mutes low-energy bins — effective but may clip low-level speech.
Dereverberation basics
- Reverberation, especially late reverb, is correlated across time and harder than additive noise.
- Single-mic dereverberation: inverse filtering, spectral subtraction variants tuned to reverb, and statistical models (linear prediction on STFT magnitudes).
- Multi-mic approaches (beamforming, multichannel inverse filtering) exploit spatial diversity for better dereverberation.
When to prefer classical methods: low-power devices, predictable stationary noise, and strict latency needs.
Modern Machine Learning Approaches
ML has driven major improvements in perceptual quality for non-stationary noise and complex acoustics.
Supervised learning setup
- Train on paired noisy/clean data to map noisy inputs to clean targets.
- Losses: MSE on waveform or spectrogram, SI-SNR, and perceptual losses.
Architectures
- DNNs (fully connected): simple but potentially large.
- CNNs: capture local time–frequency patterns.
- RNNs/LSTMs/GRUs: model temporal dependencies (e.g., RNNoise uses a small RNN).
- Transformers: capture long-range dependencies with attention; used increasingly in audio.
Generative models (GANs)
- SEGAN introduced adversarial training for waveform enhancement to improve perceptual realism.
- GANs can reduce artifacts and increase naturalness but are harder to train and less stable.
Time-domain vs time–frequency methods
- Mask-based methods predict multiplicative masks applied to magnitude or complex spectrograms; they are simpler and stable.
- End-to-end time-domain models (Conv-TasNet, WaveNet-like) directly predict waveforms and can outperform TF-mask methods for some tasks.
Lightweight hybrid systems
- RNNoise (https://github.com/xiph/rnnoise) combines DSP heuristics with a small recurrent network. It demonstrates efficient real-time denoising with low compute — a good starting point.
Datasets and benchmarks
- Microsoft DNS Challenge provides datasets, baselines, and evaluation scripts — useful for benchmarking.
Practical Implementation: Tools, Libraries, and Starter Recipes
Key libraries and tools
- Audio I/O & transforms: librosa, torchaudio
- ML frameworks: PyTorch, TensorFlow
- Speech toolkits: SpeechBrain
- Lightweight/production: RNNoise, WebRTC audio processing
WebRTC audio modules provide production-grade noise suppression, echo cancellation, and gain control; reviewing WebRTC is instructive for real-time constraints and robust implementations.
Running lightweight models
- RNNoise: easy to compile and run; designed for real-time on modest hardware.
- WebRTC: contains NS (noise suppression), AEC (acoustic echo cancellation), and AGC (automatic gain control) used in many VoIP products.
Spectral subtraction example (expanded):
# high-level spectral subtraction
import numpy as np
from librosa import stft, istft
audio = load_audio('noisy.wav', sr=16000)
S = stft(audio, n_fft=512, hop_length=128, window='hann')
mag = np.abs(S)
phase = np.angle(S)
# estimate noise using the first N frames or a VAD
noise_mag = np.mean(mag[:, :N_noise_frames], axis=1, keepdims=True)
alpha = 1.0 # over-subtraction factor
floor = 1e-6
enh_mag = np.maximum(mag - alpha * noise_mag, floor)
S_enh = enh_mag * np.exp(1j * phase)
enh_audio = istft(S_enh, hop_length=128, window='hann')
save_audio('enhanced.wav', enh_audio, sr=16000)
Dataset handling and augmentation
- Mix clean speech with recorded noise at random SNRs.
- Convolve clean speech with room impulse responses (RIRs) to simulate reverberation.
- Add device/codec distortions if relevant.
Practical pointers:
- Start with short files and low sample rates before scaling.
- Use SpeechBrain or torchaudio for pre-built models and utilities.
- For deploying small models, consult guides on pruning, quantization, and hybrid DSP+NN designs.
Evaluation: How to Measure Enhancement Quality
Objective metrics
- SNR / SDR: energy-based metrics — limited but useful.
- PESQ (Perceptual Evaluation of Speech Quality, ITU-T P.862): measures perceived speech quality.
- STOI (Short-Time Objective Intelligibility): correlates with intelligibility.
Subjective testing
- ABX or short listening tests help catch artifacts objective metrics miss.
- Rate naturalness, intelligibility, and artifacts with a small panel.
Common failure modes
- Musical noise (spectral subtraction artifacts)
- Over-suppression that removes low-level speech
- Latency artifacts like clipped transients in real-time systems
Combine objective metrics (PESQ, STOI) with quick listening tests for reliable assessment.
Deployment Considerations & Best Practices
Latency and real-time constraints
- Frame size and lookahead determine algorithmic latency. For live VoIP or hearing aids, keep end-to-end latency minimal.
- Budget for buffering, I/O, and processing latency.
CPU/GPU trade-offs and model compression
- On-device inference favors small models and DSP hybrids. On server/GPU, larger models are feasible.
- Reduce model size with pruning, quantization, and distillation.
Robustness and domain mismatch
- Domain mismatch is common: mitigate with diverse augmentations (RIRs, noise types, SNRs) and multi-mic beamforming.
Privacy: on-device vs cloud
- On-device preserves privacy and reduces bandwidth but limits compute.
- Cloud allows heavier models but requires secure transmission and privacy compliance.
Getting Started: A 30–60 Minute Roadmap for Beginners
- Install Python and libraries: librosa, numpy, soundfile.
- Visualize a spectrogram of a noisy file to see noise distribution.
- Run the spectral subtraction script above and listen to results.
- Download and run RNNoise on a short recording to compare results.
- Measure PESQ/STOI improvements using available implementations or DNS scripts.
Small project ideas:
- Enhance conference call recordings and compare PESQ before/after.
- Build a batch podcast denoiser with spectral subtraction + manual tweaks.
- Try a pretrained SpeechBrain model for a more advanced demo.
Production checklist:
- Measure latency and CPU usage.
- Confirm SNR/PESQ/STOI improvements across representative audio.
- Run brief subjective listening tests with teammates.
Further Reading & Resources
- RNNoise — https://github.com/xiph/rnnoise
- WebRTC audio processing — https://webrtc.org/
- Microsoft DNS Challenge — https://github.com/microsoft/DNS-Challenge
- SEGAN (Pascual et al., 2017) — https://arxiv.org/abs/1703.09452
- PESQ standard (ITU-T P.862) — https://www.itu.int/rec/T-REC-P.862/en
Also check SpeechBrain and torchaudio tutorials, and resources on model deployment and small ML models.
Conclusion
Key takeaways:
- Classical DSP is essential for low-latency, low-compute scenarios and remains interpretable and reliable.
- ML methods often give superior results for complex, non-stationary noise and reverberation when you have data and compute.
- Start small: try spectral subtraction, then compare RNNoise or a pretrained SpeechBrain model.
Next steps:
- Run the 30–60 minute experiment: record a noisy clip, apply RNNoise, and evaluate PESQ/STOI.
- If scaling up, gather diverse data (noises, RIRs) and benchmark using DNS Challenge resources.
FAQ & Troubleshooting Tips
Q: Which method should I choose for my project? A: If you need low-latency and low compute (embedded/real-time), start with classical DSP or a hybrid like RNNoise. If you have paired data and compute, train an ML model for better performance on non-stationary noise.
Q: How do I reduce musical noise from spectral subtraction? A: Use smoothing across time/frequency, over-subtraction carefully, noise floor limits, or switch to Wiener filtering or ML-based mask estimation to reduce musical artifacts.
Q: My model fails in new environments. How do I improve robustness? A: Augment training with diverse noises, SNRs, and RIRs; use domain randomization; consider fine-tuning on a small set of in-domain samples.
Q: How can I measure real user-perceived improvement quickly? A: Combine objective metrics (PESQ, STOI) with short ABX or MOS-like listening tests to detect artifacts and naturalness.
Q: Latency is too high in my real-time pipeline. What can I do? A: Reduce frame size and lookahead, optimize I/O and buffering, use smaller models, or offload non-real-time parts to a server if privacy allows.
Q: What are simple debugging steps if enhanced audio sounds muffled or distorted? A: Check whether the algorithm is over-suppressing low-energy bins, verify the noise estimate accuracy, listen to intermediate outputs (magnitude vs reconstructed waveform), and ensure correct STFT/ISTFT parameters and windowing.
References
- RNNoise — Xiph/Jean-Marc Valin (GitHub)
- WebRTC (Official)
- Microsoft DNS Challenge
- SEGAN (Pascual et al., 2017)
- ITU-T P.862 — PESQ standard
Suggested CTAs:
- Try the spectral subtraction script, run RNNoise on a clip, and compare PESQ/STOI scores.
- Explore Microsoft DNS Challenge datasets and SpeechBrain models for benchmarking and learning.