torchaudio.compliance.kaldi¶
The useful processing operations of kaldi can be performed with torchaudio. Various functions with identical parameters are given so that torchaudio can produce similar outputs.
Functions¶
spectrogram¶
-
torchaudio.compliance.kaldi.
spectrogram
(waveform: torch.Tensor, blackman_coeff: float = 0.42, channel: int = -1, dither: float = 0.0, energy_floor: float = 1.0, frame_length: float = 25.0, frame_shift: float = 10.0, min_duration: float = 0.0, preemphasis_coefficient: float = 0.97, raw_energy: bool = True, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, sample_frequency: float = 16000.0, snip_edges: bool = True, subtract_mean: bool = False, window_type: str = 'povey') → torch.Tensor[source]¶ Create a spectrogram from a raw audio signal. This matches the input/output of Kaldi’s compute-spectrogram-feats.
Parameters: - waveform (Tensor) – Tensor of audio of size (c, n) where c is in the range [0,2)
- blackman_coeff (float, optional) – Constant coefficient for generalized Blackman window. (Default:
0.42
) - channel (int, optional) – Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (Default:
-1
) - dither (float, optional) – Dithering constant (0.0 means no dither). If you turn this off, you should set
the energy_floor option, e.g. to 1.0 or 0.1 (Default:
0.0
) - energy_floor (float, optional) – Floor on energy (absolute, not relative) in Spectrogram computation. Caution:
this floor is applied to the zeroth component, representing the total signal energy. The floor on the
individual spectrogram elements is fixed at std::numeric_limits<float>::epsilon(). (Default:
1.0
) - frame_length (float, optional) – Frame length in milliseconds (Default:
25.0
) - frame_shift (float, optional) – Frame shift in milliseconds (Default:
10.0
) - min_duration (float, optional) – Minimum duration of segments to process (in seconds). (Default:
0.0
) - preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis (Default:
0.97
) - raw_energy (bool, optional) – If True, compute energy before preemphasis and windowing (Default:
True
) - remove_dc_offset (bool, optional) – Subtract mean from waveform on each frame (Default:
True
) - round_to_power_of_two (bool, optional) – If True, round window size to power of two by zero-padding input
to FFT. (Default:
True
) - sample_frequency (float, optional) – Waveform data sample frequency (must match the waveform file, if
specified there) (Default:
16000.0
) - snip_edges (bool, optional) – If True, end effects will be handled by outputting only frames that completely fit
in the file, and the number of frames depends on the frame_length. If False, the number of frames
depends only on the frame_shift, and we reflect the data at the ends. (Default:
True
) - subtract_mean (bool, optional) – Subtract mean of each feature file [CMS]; not recommended to do
it this way. (Default:
False
) - window_type (str, optional) – Type of window (‘hamming’|’hanning’|’povey’|’rectangular’|’blackman’)
(Default:
'povey'
)
Returns: A spectrogram identical to what Kaldi would output. The shape is (m,
padded_window_size // 2 + 1
) where m is calculated in _get_stridedReturn type: Tensor
fbank¶
-
torchaudio.compliance.kaldi.
fbank
(waveform: torch.Tensor, blackman_coeff: float = 0.42, channel: int = -1, dither: float = 0.0, energy_floor: float = 1.0, frame_length: float = 25.0, frame_shift: float = 10.0, high_freq: float = 0.0, htk_compat: bool = False, low_freq: float = 20.0, min_duration: float = 0.0, num_mel_bins: int = 23, preemphasis_coefficient: float = 0.97, raw_energy: bool = True, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, sample_frequency: float = 16000.0, snip_edges: bool = True, subtract_mean: bool = False, use_energy: bool = False, use_log_fbank: bool = True, use_power: bool = True, vtln_high: float = -500.0, vtln_low: float = 100.0, vtln_warp: float = 1.0, window_type: str = 'povey') → torch.Tensor[source]¶ Create a fbank from a raw audio signal. This matches the input/output of Kaldi’s compute-fbank-feats.
Parameters: - waveform (Tensor) – Tensor of audio of size (c, n) where c is in the range [0,2)
- blackman_coeff (float, optional) – Constant coefficient for generalized Blackman window. (Default:
0.42
) - channel (int, optional) – Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (Default:
-1
) - dither (float, optional) – Dithering constant (0.0 means no dither). If you turn this off, you should set
the energy_floor option, e.g. to 1.0 or 0.1 (Default:
0.0
) - energy_floor (float, optional) – Floor on energy (absolute, not relative) in Spectrogram computation. Caution:
this floor is applied to the zeroth component, representing the total signal energy. The floor on the
individual spectrogram elements is fixed at std::numeric_limits<float>::epsilon(). (Default:
1.0
) - frame_length (float, optional) – Frame length in milliseconds (Default:
25.0
) - frame_shift (float, optional) – Frame shift in milliseconds (Default:
10.0
) - high_freq (float, optional) – High cutoff frequency for mel bins (if <= 0, offset from Nyquist)
(Default:
0.0
) - htk_compat (bool, optional) – If true, put energy last. Warning: not sufficient to get HTK compatible features
(need to change other parameters). (Default:
False
) - low_freq (float, optional) – Low cutoff frequency for mel bins (Default:
20.0
) - min_duration (float, optional) – Minimum duration of segments to process (in seconds). (Default:
0.0
) - num_mel_bins (int, optional) – Number of triangular mel-frequency bins (Default:
23
) - preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis (Default:
0.97
) - raw_energy (bool, optional) – If True, compute energy before preemphasis and windowing (Default:
True
) - remove_dc_offset (bool, optional) – Subtract mean from waveform on each frame (Default:
True
) - round_to_power_of_two (bool, optional) – If True, round window size to power of two by zero-padding input
to FFT. (Default:
True
) - sample_frequency (float, optional) – Waveform data sample frequency (must match the waveform file, if
specified there) (Default:
16000.0
) - snip_edges (bool, optional) – If True, end effects will be handled by outputting only frames that completely fit
in the file, and the number of frames depends on the frame_length. If False, the number of frames
depends only on the frame_shift, and we reflect the data at the ends. (Default:
True
) - subtract_mean (bool, optional) – Subtract mean of each feature file [CMS]; not recommended to do
it this way. (Default:
False
) - use_energy (bool, optional) – Add an extra dimension with energy to the FBANK output. (Default:
False
) - use_log_fbank (bool, optional) – If true, produce log-filterbank, else produce linear. (Default:
True
) - use_power (bool, optional) – If true, use power, else use magnitude. (Default:
True
) - vtln_high (float, optional) – High inflection point in piecewise linear VTLN warping function (if
negative, offset from high-mel-freq (Default:
-500.0
) - vtln_low (float, optional) – Low inflection point in piecewise linear VTLN warping function (Default:
100.0
) - vtln_warp (float, optional) – Vtln warp factor (only applicable if vtln_map not specified) (Default:
1.0
) - window_type (str, optional) – Type of window (‘hamming’|’hanning’|’povey’|’rectangular’|’blackman’)
(Default:
'povey'
)
Returns: A fbank identical to what Kaldi would output. The shape is (m,
num_mel_bins + use_energy
) where m is calculated in _get_stridedReturn type: Tensor
mfcc¶
-
torchaudio.compliance.kaldi.
mfcc
(waveform: torch.Tensor, blackman_coeff: float = 0.42, cepstral_lifter: float = 22.0, channel: int = -1, dither: float = 0.0, energy_floor: float = 1.0, frame_length: float = 25.0, frame_shift: float = 10.0, high_freq: float = 0.0, htk_compat: bool = False, low_freq: float = 20.0, num_ceps: int = 13, min_duration: float = 0.0, num_mel_bins: int = 23, preemphasis_coefficient: float = 0.97, raw_energy: bool = True, remove_dc_offset: bool = True, round_to_power_of_two: bool = True, sample_frequency: float = 16000.0, snip_edges: bool = True, subtract_mean: bool = False, use_energy: bool = False, vtln_high: float = -500.0, vtln_low: float = 100.0, vtln_warp: float = 1.0, window_type: str = 'povey') → torch.Tensor[source]¶ Create a mfcc from a raw audio signal. This matches the input/output of Kaldi’s compute-mfcc-feats.
Parameters: - waveform (Tensor) – Tensor of audio of size (c, n) where c is in the range [0,2)
- blackman_coeff (float, optional) – Constant coefficient for generalized Blackman window. (Default:
0.42
) - cepstral_lifter (float, optional) – Constant that controls scaling of MFCCs (Default:
22.0
) - channel (int, optional) – Channel to extract (-1 -> expect mono, 0 -> left, 1 -> right) (Default:
-1
) - dither (float, optional) – Dithering constant (0.0 means no dither). If you turn this off, you should set
the energy_floor option, e.g. to 1.0 or 0.1 (Default:
0.0
) - energy_floor (float, optional) – Floor on energy (absolute, not relative) in Spectrogram computation. Caution:
this floor is applied to the zeroth component, representing the total signal energy. The floor on the
individual spectrogram elements is fixed at std::numeric_limits<float>::epsilon(). (Default:
1.0
) - frame_length (float, optional) – Frame length in milliseconds (Default:
25.0
) - frame_shift (float, optional) – Frame shift in milliseconds (Default:
10.0
) - high_freq (float, optional) – High cutoff frequency for mel bins (if <= 0, offset from Nyquist)
(Default:
0.0
) - htk_compat (bool, optional) – If true, put energy last. Warning: not sufficient to get HTK compatible
features (need to change other parameters). (Default:
False
) - low_freq (float, optional) – Low cutoff frequency for mel bins (Default:
20.0
) - num_ceps (int, optional) – Number of cepstra in MFCC computation (including C0) (Default:
13
) - min_duration (float, optional) – Minimum duration of segments to process (in seconds). (Default:
0.0
) - num_mel_bins (int, optional) – Number of triangular mel-frequency bins (Default:
23
) - preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis (Default:
0.97
) - raw_energy (bool, optional) – If True, compute energy before preemphasis and windowing (Default:
True
) - remove_dc_offset (bool, optional) – Subtract mean from waveform on each frame (Default:
True
) - round_to_power_of_two (bool, optional) – If True, round window size to power of two by zero-padding input
to FFT. (Default:
True
) - sample_frequency (float, optional) – Waveform data sample frequency (must match the waveform file, if
specified there) (Default:
16000.0
) - snip_edges (bool, optional) – If True, end effects will be handled by outputting only frames that completely fit
in the file, and the number of frames depends on the frame_length. If False, the number of frames
depends only on the frame_shift, and we reflect the data at the ends. (Default:
True
) - subtract_mean (bool, optional) – Subtract mean of each feature file [CMS]; not recommended to do
it this way. (Default:
False
) - use_energy (bool, optional) – Add an extra dimension with energy to the FBANK output. (Default:
False
) - vtln_high (float, optional) – High inflection point in piecewise linear VTLN warping function (if
negative, offset from high-mel-freq (Default:
-500.0
) - vtln_low (float, optional) – Low inflection point in piecewise linear VTLN warping function (Default:
100.0
) - vtln_warp (float, optional) – Vtln warp factor (only applicable if vtln_map not specified) (Default:
1.0
) - window_type (str, optional) – Type of window (‘hamming’|’hanning’|’povey’|’rectangular’|’blackman’)
(Default:
"povey"
)
Returns: A mfcc identical to what Kaldi would output. The shape is (m,
num_ceps
) where m is calculated in _get_stridedReturn type: Tensor
resample_waveform¶
-
torchaudio.compliance.kaldi.
resample_waveform
(waveform: torch.Tensor, orig_freq: float, new_freq: float, lowpass_filter_width: int = 6) → torch.Tensor[source]¶ Resamples the waveform at the new frequency. This matches Kaldi’s OfflineFeatureTpl ResampleWaveform which uses a LinearResample (resample a signal at linearly spaced intervals to upsample/downsample a signal). LinearResample (LR) means that the output signal is at linearly spaced intervals (i.e the output signal has a frequency of
new_freq
). It uses sinc/bandlimited interpolation to upsample/downsample the signal.https://ccrma.stanford.edu/~jos/resample/Theory_Ideal_Bandlimited_Interpolation.html https://github.com/kaldi-asr/kaldi/blob/master/src/feat/resample.h#L56
Parameters: - waveform (Tensor) – The input signal of size (c, n)
- orig_freq (float) – The original frequency of the signal
- new_freq (float) – The desired frequency
- lowpass_filter_width (int, optional) – Controls the sharpness of the filter, more == sharper
but less efficient. We suggest around 4 to 10 for normal use. (Default:
6
)
Returns: The waveform at the new frequency
Return type: Tensor