gluonar.nn¶

Neural Network Components.

Hint

Not every component listed here is HybridBlock, which means some of them are not hybridizable. However, we are trying our best to make sure components required during inference are hybridizable so the entire network can be exported and run in other languages.

For example, encoders are usually non-hybridizable but are only required during training. In contrast, decoders are mostly `HybridBlock`s.

Basic Blocks¶

Blocks that usually used in audio processing.

`SincConv1D`	Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper.
`ZScoreNormBlock`	Zero Score Normalization Block
`STFTBlock`	Short-Time Fourier Transform Block.
`DCT1D`	Compute the Discrete Cosine Transform of input data.
`MelSpectrogram`	Compute a mel-scaled spectrogram.
`MFCC`	Mel-frequency cepstral coefficients (MFCCs)
`PowerToDB`	Convert a power spectrogram (amplitude squared) to decibel (dB) units.

API Reference¶

Basic Blocks used in GluonAR.

class gluonar.nn.basic_blocks.SincConv1D¶: Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper.

class gluonar.nn.basic_blocks.ZScoreNormBlock¶: Zero Score Normalization Block

class gluonar.nn.basic_blocks.STFTBlock¶

Short-Time Fourier Transform Block.

Parameters:

audio_length (int.) – target audio length.
n_fft (int > 0 [scalar]) – length of the FFT window
hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
win_length (int <= n_fft [scalar]) – Each frame of audio is windowed by window(). The window will be of length win_length and then padded with zeros to match n_fft. If unspecified, defaults to win_length = n_fft.
window (string [shape=(n_fft,)]) – A window specification (string, tuple, or number), see scipy.signal.get_window
center (boolean) – If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length]
power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.

Inputs:
- x: the input audio signal, with shape (batch_size, audio_length).
Outputs:
- specs: specs tensor with shape (batch_size, 1, num_frames, n_fft/2).

Notes

The num_frames is calculated by 1+(len(y)-n_fft)/hop_length when center is True, and different from librosa the output should be transposed before visualization.

class gluonar.nn.basic_blocks.DCT1D¶

Compute the Discrete Cosine Transform of input data. This block is implemented as scipy.

DCT1D’s behavior is compute dct along last axis for any dimensions larger than 2.

Parameters:	mode ({1, 2, 3}, optional.) – Type of the DCT (see Notes). Default type is 2. N (int.) – Length of the transform. The required value is `N = x.shape[axis]`. norm ({None, 'ortho'}, optional.) – Normalization mode (see Notes). Default is None.

Notes

Type II

There are several definitions of the DCT-II; use scipy definition following (for norm=None):

          N-1
y[k] = 2* sum x[n]*cos(pi*k*(2n+1)/(2*N)), 0 <= k < N.
          n=0

If norm='ortho', y[k] is multiplied by a scaling factor f:

f = sqrt(1/(4*N)) if k = 0,
f = sqrt(1/(2*N)) otherwise.

Which makes the corresponding matrix of coefficients orthonormal (OO' = Id).

class gluonar.nn.basic_blocks.MelSpectrogram¶

Compute a mel-scaled spectrogram.

Parameters:

audio_length (int.) – target audio length.
sr (number > 0 [scalar]) – sampling rate of audio
n_fft (int > 0 [scalar]) – length of the FFT window
hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
others (additional arguments) – Mel filter bank parameters. See librosa.filters.mel for details.

class gluonar.nn.basic_blocks.MFCC¶

Mel-frequency cepstral coefficients (MFCCs)

Parameters:

audio_length (int.) – target audio length.
sr (number > 0 [scalar]) – sampling rate of y
n_mfcc (int > 0 [scalar]) – number of MFCCs to return
dct_type (None, or {1, 2, 3}) – Discrete cosine transform (DCT) type. Now only DCT type-2 is used.
norm (None or 'ortho') – If dct_type is 2 or 3, setting norm=’ortho’ uses an ortho-normal DCT basis. Normalization is not supported for dct_type=1.