Spectrogram of Speech in Python

 

Spectrogram of Speech

Author: Bindeshwar Singh Kushwaha — PostNetwork Academy

What is a Spectrogram?

  • Spectrogram — a visual representation of sound.
  • Shows how the frequency content of a signal changes over time.
  • Axes of a spectrogram:
    • X-axis: Time (seconds)
    • Y-axis: Frequency (Hz)
    • Color / Intensity: Amplitude or Power (dB)
  • Computed using the Short-Time Fourier Transform (STFT).
  • Widely used in speech, music, and audio signal processing.

Steps to Generate Spectrogram in Python

  1. Load the audio file with librosa.load().
  2. Compute STFT with librosa.stft().
  3. Convert amplitude to decibels with librosa.amplitude_to_db() for visualization.
  4. Plot with librosa.display.specshow().

Python Code


import librosa
import librosa.display
import matplotlib.pyplot as plt
import numpy as np

# Load the speech audio file
y, sr = librosa.load("stop.wav", sr=None)

# Compute Short-Time Fourier Transform (STFT)
D = librosa.stft(y)

# Convert to decibels (log scale for better visualization)
S_db = librosa.amplitude_to_db(np.abs(D), ref=np.max)

# Plot the spectrogram
plt.figure(figsize=(10, 5))
librosa.display.specshow(S_db, sr=sr, x_axis="time", y_axis="hz", cmap="magma")
plt.colorbar(format="%+2.0f dB")
plt.title("Spectrogram of Speech (stop.wav)")
plt.tight_layout()
plt.show()
    

Mathematical Explanation of STFT

  • The speech signal in the time domain: \\(x(t)\\).
  • Fourier Transform (no time info): \\(X(\\omega) = \\int_{-\\infty}^{\\infty} x(t) e^{-j\\omega t}\\, dt\\).
  • STFT (sliding window) :
    \\[
    \\mathrm{STFT}\\{x(t)\\}(\\tau,\\omega) = \\int_{-\\infty}^{\\infty} x(t)\\, w(t-\\tau)\\, e^{-j\\omega t}\\, dt
    \\]
  • Spectrogram is squared magnitude of STFT:
    \\[
    \\text{Spectrogram}(\\tau,\\omega) = \\left| \\mathrm{STFT}\\{x(t)\\}(\\tau,\\omega) \\right|^2
    \\]

Output: Spectrogram of Speech

Figure: Spectrogram (STFT) visualized from stop.wav.

Use of STFT in Feature Extraction

  • The raw audio \\(x(t)\\) is in time domain.
  • STFT converts the signal into a time–frequency representation (spectrogram).
  • From the spectrogram we extract features:
    • Energy / Power Spectrum — loudness measures
    • Pitch and Formants — speech characteristics
    • MFCCs — Mel-Frequency Cepstral Coefficients, used in speech recognition
  • These features feed ML models for:
    • Speech recognition
    • Speaker identification
    • Emotion / sentiment analysis

Reach PostNetwork Academy

Website: www.postnetwork.co

YouTube: www.youtube.com/@postnetworkacademy

Facebook: www.facebook.com/postnetworkacademy

LinkedIn: www.linkedin.com/company/postnetworkacademy

GitHub: www.github.com/postnetworkacademy

Thank You!

©Postnetwork-All rights reserved.