gluonar.nn

Neural Network Components.

Hint

Not every component listed here is HybridBlock, which means some of them are not hybridizable. However, we are trying our best to make sure components required during inference are hybridizable so the entire network can be exported and run in other languages.

For example, encoders are usually non-hybridizable but are only required during training. In contrast, decoders are mostly `HybridBlock`s.

Basic Blocks

Blocks that usually used in audio processing.

SincConv1D Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper.
ZScoreNormBlock Zero Score Normalization Block
STFTBlock Short-Time Fourier Transform Block.
DCT1D Compute the Discrete Cosine Transform of input data.
MelSpectrogram Compute a mel-scaled spectrogram.
MFCC Mel-frequency cepstral coefficients (MFCCs)
PowerToDB Convert a power spectrogram (amplitude squared) to decibel (dB) units.

API Reference

Basic Blocks used in GluonAR.

class gluonar.nn.basic_blocks.SincConv1D

Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper.

class gluonar.nn.basic_blocks.ZScoreNormBlock

Zero Score Normalization Block

class gluonar.nn.basic_blocks.STFTBlock

Short-Time Fourier Transform Block.

Parameters:
  • audio_length (int.) – target audio length.
  • n_fft (int > 0 [scalar]) – length of the FFT window
  • hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
  • win_length (int <= n_fft [scalar]) – Each frame of audio is windowed by window(). The window will be of length win_length and then padded with zeros to match n_fft. If unspecified, defaults to win_length = n_fft.
  • window (string [shape=(n_fft,)]) – A window specification (string, tuple, or number), see scipy.signal.get_window
  • center (boolean) – If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length]
  • power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
  • Inputs:
    • x: the input audio signal, with shape (batch_size, audio_length).
  • Outputs:
    • specs: specs tensor with shape (batch_size, 1, num_frames, n_fft/2).

Notes

The num_frames is calculated by 1+(len(y)-n_fft)/hop_length when center is True, and different from librosa the output should be transposed before visualization.

class gluonar.nn.basic_blocks.DCT1D

Compute the Discrete Cosine Transform of input data. This block is implemented as scipy.

DCT1D’s behavior is compute dct along last axis for any dimensions larger than 2.

Parameters:
  • mode ({1, 2, 3}, optional.) – Type of the DCT (see Notes). Default type is 2.
  • N (int.) – Length of the transform. The required value is N = x.shape[axis].
  • norm ({None, 'ortho'}, optional.) – Normalization mode (see Notes). Default is None.

Notes

Type II

There are several definitions of the DCT-II; use scipy definition following (for norm=None):

          N-1
y[k] = 2* sum x[n]*cos(pi*k*(2n+1)/(2*N)), 0 <= k < N.
          n=0

If norm='ortho', y[k] is multiplied by a scaling factor f:

f = sqrt(1/(4*N)) if k = 0,
f = sqrt(1/(2*N)) otherwise.

Which makes the corresponding matrix of coefficients orthonormal (OO' = Id).

class gluonar.nn.basic_blocks.MelSpectrogram

Compute a mel-scaled spectrogram.

Parameters:
  • audio_length (int.) – target audio length.
  • sr (number > 0 [scalar]) – sampling rate of audio
  • n_fft (int > 0 [scalar]) – length of the FFT window
  • hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
  • power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
  • others (additional arguments) – Mel filter bank parameters. See librosa.filters.mel for details.
class gluonar.nn.basic_blocks.MFCC

Mel-frequency cepstral coefficients (MFCCs)

Parameters:
  • audio_length (int.) – target audio length.
  • sr (number > 0 [scalar]) – sampling rate of y
  • n_mfcc (int > 0 [scalar]) – number of MFCCs to return
  • dct_type (None, or {1, 2, 3}) – Discrete cosine transform (DCT) type. Now only DCT type-2 is used.
  • norm (None or 'ortho') – If dct_type is 2 or 3, setting norm=’ortho’ uses an ortho-normal DCT basis. Normalization is not supported for dct_type=1.

See also

librosa.melspectrogram, scipy.fftpack.dct

class gluonar.nn.basic_blocks.PowerToDB

Convert a power spectrogram (amplitude squared) to decibel (dB) units. This is modified from librosa.power_to_db, and make it be able to process batch input.

For input shape of (batch, channel, w, h) or (batch, w, h), this block will compute power to db along last 2 axis.

Parameters:
  • ref (float.) – The amplitude abs(S) is scaled relative to ref: 10 * log10(S / ref).
  • amin (float > 0 [scalar]) – minimum threshold for abs(S) and ref
  • top_db (float >= 0 [scalar]) – threshold the output at top_db below the peak: max(10 * log10(S)) - top_db