gluonar.nn¶
Neural Network Components.
Hint
Not every component listed here is HybridBlock, which means some of them are not hybridizable. However, we are trying our best to make sure components required during inference are hybridizable so the entire network can be exported and run in other languages.
For example, encoders are usually non-hybridizable but are only required during training. In contrast, decoders are mostly `HybridBlock`s.
Basic Blocks¶
Blocks that usually used in audio processing.
SincConv1D |
Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper. |
ZScoreNormBlock |
Zero Score Normalization Block |
STFTBlock |
Short-Time Fourier Transform Block. |
DCT1D |
Compute the Discrete Cosine Transform of input data. |
MelSpectrogram |
Compute a mel-scaled spectrogram. |
MFCC |
Mel-frequency cepstral coefficients (MFCCs) |
PowerToDB |
Convert a power spectrogram (amplitude squared) to decibel (dB) units. |
API Reference¶
Basic Blocks used in GluonAR.
-
class
gluonar.nn.basic_blocks.
SincConv1D
¶ Sinc Conv Block from “Speaker Recognition from Raw Waveform with SincNet” paper.
-
class
gluonar.nn.basic_blocks.
ZScoreNormBlock
¶ Zero Score Normalization Block
-
class
gluonar.nn.basic_blocks.
STFTBlock
¶ Short-Time Fourier Transform Block.
Parameters: - audio_length (int.) – target audio length.
- n_fft (int > 0 [scalar]) – length of the FFT window
- hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
- win_length (int <= n_fft [scalar]) – Each frame of audio is windowed by window().
The window will be of length win_length and then padded
with zeros to match n_fft.
If unspecified, defaults to
win_length = n_fft
. - window (string [shape=(n_fft,)]) – A window specification (string, tuple, or number), see scipy.signal.get_window
- center (boolean) – If True, the signal y is padded so that frame D[:, t] is centered at y[t * hop_length]. If False, then D[:, t] begins at y[t * hop_length]
- power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
- Inputs:
- x: the input audio signal, with shape
(batch_size, audio_length)
.
- x: the input audio signal, with shape
- Outputs:
- specs: specs tensor with shape
(batch_size, 1, num_frames, n_fft/2)
.
- specs: specs tensor with shape
Notes
The num_frames is calculated by
1+(len(y)-n_fft)/hop_length
when center is True, and different from librosa the output should be transposed before visualization.
-
class
gluonar.nn.basic_blocks.
DCT1D
¶ Compute the Discrete Cosine Transform of input data. This block is implemented as scipy.
DCT1D’s behavior is compute dct along last axis for any dimensions larger than 2.
Parameters: - mode ({1, 2, 3}, optional.) – Type of the DCT (see Notes). Default type is 2.
- N (int.) – Length of the transform. The required value is
N = x.shape[axis]
. - norm ({None, 'ortho'}, optional.) – Normalization mode (see Notes). Default is None.
Notes
Type II
There are several definitions of the DCT-II; use scipy definition following (for
norm=None
):N-1 y[k] = 2* sum x[n]*cos(pi*k*(2n+1)/(2*N)), 0 <= k < N. n=0
If
norm='ortho'
,y[k]
is multiplied by a scaling factor f:f = sqrt(1/(4*N)) if k = 0, f = sqrt(1/(2*N)) otherwise.
Which makes the corresponding matrix of coefficients orthonormal (
OO' = Id
).
-
class
gluonar.nn.basic_blocks.
MelSpectrogram
¶ Compute a mel-scaled spectrogram.
Parameters: - audio_length (int.) – target audio length.
- sr (number > 0 [scalar]) – sampling rate of audio
- n_fft (int > 0 [scalar]) – length of the FFT window
- hop_length (int > 0 [scalar]) – number of samples between successive frames. See librosa.core.stft
- power (float > 0 [scalar]) – Exponent for the magnitude melspectrogram. e.g., 1 for energy, 2 for power, etc.
- others (additional arguments) – Mel filter bank parameters. See librosa.filters.mel for details.
-
class
gluonar.nn.basic_blocks.
MFCC
¶ Mel-frequency cepstral coefficients (MFCCs)
Parameters: - audio_length (int.) – target audio length.
- sr (number > 0 [scalar]) – sampling rate of y
- n_mfcc (int > 0 [scalar]) – number of MFCCs to return
- dct_type (None, or {1, 2, 3}) – Discrete cosine transform (DCT) type. Now only DCT type-2 is used.
- norm (None or 'ortho') – If dct_type is 2 or 3, setting norm=’ortho’ uses an ortho-normal DCT basis. Normalization is not supported for dct_type=1.
See also
librosa.melspectrogram
,scipy.fftpack.dct
-
class
gluonar.nn.basic_blocks.
PowerToDB
¶ Convert a power spectrogram (amplitude squared) to decibel (dB) units. This is modified from librosa.power_to_db, and make it be able to process batch input.
For input shape of (batch, channel, w, h) or (batch, w, h), this block will compute power to db along last 2 axis.
Parameters: - ref (float.) – The amplitude abs(S) is scaled relative to ref: 10 * log10(S / ref).
- amin (float > 0 [scalar]) – minimum threshold for abs(S) and ref
- top_db (float >= 0 [scalar]) – threshold the output at top_db below the peak:
max(10 * log10(S)) - top_db