torchaudio.transforms¶
Transforms are common audio transforms. They can be chained together using torch.nn.Sequential
Spectrogram¶
-
class
torchaudio.transforms.
Spectrogram
(n_fft: int = 400, win_length: Union[int, NoneType] = None, hop_length: Union[int, NoneType] = None, pad: int = 0, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: Union[float, NoneType] = 2.0, normalized: bool = False, wkwargs: Union[dict, NoneType] = None) → None[source]¶ Create a spectrogram from a audio signal.
Parameters: - n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
bins. (Default:400
) - win_length (int or None, optional) – Window size. (Default:
n_fft
) - hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
) - pad (int, optional) – Two sided padding of signal. (Default:
0
) - window_fn (Callable[.., Tensor], optional) – A function to create a window tensor
that is applied/multiplied to each frame/window. (Default:
torch.hann_window
) - power (float or None, optional) – Exponent for the magnitude spectrogram,
(must be > 0) e.g., 1 for energy, 2 for power, etc.
If None, then the complex spectrum is returned instead. (Default:
2
) - normalized (bool, optional) – Whether to normalize by magnitude after stft. (Default:
False
) - wkwargs (dict or None, optional) – Arguments for window function. (Default:
None
)
-
forward
(waveform: torch.Tensor) → torch.Tensor[source]¶ Parameters: waveform (Tensor) – Tensor of audio of dimension (…, time). Returns: Dimension (…, freq, time), where freq is n_fft // 2 + 1
wheren_fft
is the number of Fourier bins, and time is the number of window hops (n_frame).Return type: Tensor
- n_fft (int, optional) – Size of FFT, creates
GriffinLim¶
-
class
torchaudio.transforms.
GriffinLim
(n_fft: int = 400, n_iter: int = 32, win_length: Union[int, NoneType] = None, hop_length: Union[int, NoneType] = None, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, power: float = 2.0, normalized: bool = False, wkwargs: Union[dict, NoneType] = None, momentum: float = 0.99, length: Union[int, NoneType] = None, rand_init: bool = True) → None[source]¶ - Compute waveform from a linear scale magnitude spectrogram using the Griffin-Lim transformation.
- Implementation ported from librosa.
[1] McFee, Brian, Colin Raffel, Dawen Liang, Daniel PW Ellis, Matt McVicar, Eric Battenberg, and Oriol Nieto. “librosa: Audio and music signal analysis in python.” In Proceedings of the 14th python in science conference, pp. 18-25. 2015. [2] Perraudin, N., Balazs, P., & Søndergaard, P. L. “A fast Griffin-Lim algorithm,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (pp. 1-4), Oct. 2013. [3] D. W. Griffin and J. S. Lim, “Signal estimation from modified short-time Fourier transform,” IEEE Trans. ASSP, vol.32, no.2, pp.236–243, Apr. 1984. Parameters: - n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
bins. (Default:400
) - n_iter (int, optional) – Number of iteration for phase recovery process. (Default:
32
) - win_length (int or None, optional) – Window size. (Default:
n_fft
) - hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
) - window_fn (Callable[.., Tensor], optional) – A function to create a window tensor
that is applied/multiplied to each frame/window. (Default:
torch.hann_window
) - power (float, optional) – Exponent for the magnitude spectrogram,
(must be > 0) e.g., 1 for energy, 2 for power, etc. (Default:
2
) - normalized (bool, optional) – Whether to normalize by magnitude after stft. (Default:
False
) - wkwargs (dict or None, optional) – Arguments for window function. (Default:
None
) - momentum (float, optional) – The momentum parameter for fast Griffin-Lim.
Setting this to 0 recovers the original Griffin-Lim method.
Values near 1 can lead to faster convergence, but above 1 may not converge. (Default:
0.99
) - length (int, optional) – Array length of the expected output. (Default:
None
) - rand_init (bool, optional) – Initializes phase randomly if True and to zero otherwise. (Default:
True
)
AmplitudeToDB¶
-
class
torchaudio.transforms.
AmplitudeToDB
(stype: str = 'power', top_db: Union[float, NoneType] = None) → None[source]¶ Turn a tensor from the power/amplitude scale to the decibel scale.
This output depends on the maximum value in the input tensor, and so may return different values for an audio clip split into snippets vs. a a full clip.
Parameters: -
forward
(x: torch.Tensor) → torch.Tensor[source]¶ Numerically stable implementation from Librosa. https://librosa.github.io/librosa/_modules/librosa/core/spectrum.html
Parameters: x (Tensor) – Input tensor before being converted to decibel scale. Returns: Output tensor in decibel scale. Return type: Tensor
-
MelScale¶
-
class
torchaudio.transforms.
MelScale
(n_mels: int = 128, sample_rate: int = 16000, f_min: float = 0.0, f_max: Union[float, NoneType] = None, n_stft: Union[int, NoneType] = None) → None[source]¶ Turn a normal STFT into a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
User can control which device the filter bank (fb) is (e.g. fb.to(spec_f.device)).
Parameters: - n_mels (int, optional) – Number of mel filterbanks. (Default:
128
) - sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
) - f_min (float, optional) – Minimum frequency. (Default:
0.
) - f_max (float or None, optional) – Maximum frequency. (Default:
sample_rate // 2
) - n_stft (int, optional) – Number of bins in STFT. Calculated from first input
if None is given. See
n_fft
inSpectrogram
. (Default:None
)
- n_mels (int, optional) – Number of mel filterbanks. (Default:
InverseMelScale¶
-
class
torchaudio.transforms.
InverseMelScale
(n_stft: int, n_mels: int = 128, sample_rate: int = 16000, f_min: float = 0.0, f_max: Union[float, NoneType] = None, max_iter: int = 100000, tolerance_loss: float = 1e-05, tolerance_change: float = 1e-08, sgdargs: Union[dict, NoneType] = None) → None[source]¶ Solve for a normal STFT from a mel frequency STFT, using a conversion matrix. This uses triangular filter banks.
It minimizes the euclidian norm between the input mel-spectrogram and the product between the estimated spectrogram and the filter banks using SGD.
Parameters: - n_stft (int) – Number of bins in STFT. See
n_fft
inSpectrogram
. - n_mels (int, optional) – Number of mel filterbanks. (Default:
128
) - sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
) - f_min (float, optional) – Minimum frequency. (Default:
0.
) - f_max (float or None, optional) – Maximum frequency. (Default:
sample_rate // 2
) - max_iter (int, optional) – Maximum number of optimization iterations. (Default:
100000
) - tolerance_loss (float, optional) – Value of loss to stop optimization at. (Default:
1e-5
) - tolerance_change (float, optional) – Difference in losses to stop optimization at. (Default:
1e-8
) - sgdargs (dict or None, optional) – Arguments for the SGD optimizer. (Default:
None
)
- n_stft (int) – Number of bins in STFT. See
MelSpectrogram¶
-
class
torchaudio.transforms.
MelSpectrogram
(sample_rate: int = 16000, n_fft: int = 400, win_length: Union[int, NoneType] = None, hop_length: Union[int, NoneType] = None, f_min: float = 0.0, f_max: Union[float, NoneType] = None, pad: int = 0, n_mels: int = 128, window_fn: Callable[[...], torch.Tensor] = <built-in method hann_window of type object>, wkwargs: Union[dict, NoneType] = None) → None[source]¶ Create MelSpectrogram for a raw audio signal. This is a composition of Spectrogram and MelScale.
- Sources
Parameters: - sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
) - win_length (int or None, optional) – Window size. (Default:
n_fft
) - hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
) - n_fft (int, optional) – Size of FFT, creates
n_fft // 2 + 1
bins. (Default:400
) - f_min (float, optional) – Minimum frequency. (Default:
0.
) - f_max (float or None, optional) – Maximum frequency. (Default:
None
) - pad (int, optional) – Two sided padding of signal. (Default:
0
) - n_mels (int, optional) – Number of mel filterbanks. (Default:
128
) - window_fn (Callable[.., Tensor], optional) – A function to create a window tensor
that is applied/multiplied to each frame/window. (Default:
torch.hann_window
) - wkwargs (Dict[.., ..] or None, optional) – Arguments for window function. (Default:
None
)
- Example
>>> waveform, sample_rate = torchaudio.load('test.wav', normalization=True) >>> mel_specgram = transforms.MelSpectrogram(sample_rate)(waveform) # (channel, n_mels, time)
MFCC¶
-
class
torchaudio.transforms.
MFCC
(sample_rate: int = 16000, n_mfcc: int = 40, dct_type: int = 2, norm: str = 'ortho', log_mels: bool = False, melkwargs: Union[dict, NoneType] = None) → None[source]¶ Create the Mel-frequency cepstrum coefficients from an audio signal.
By default, this calculates the MFCC on the DB-scaled Mel spectrogram. This is not the textbook implementation, but is implemented here to give consistency with librosa.
This output depends on the maximum value in the input spectrogram, and so may return different values for an audio clip split into snippets vs. a a full clip.
Parameters: - sample_rate (int, optional) – Sample rate of audio signal. (Default:
16000
) - n_mfcc (int, optional) – Number of mfc coefficients to retain. (Default:
40
) - dct_type (int, optional) – type of DCT (discrete cosine transform) to use. (Default:
2
) - norm (str, optional) – norm to use. (Default:
'ortho'
) - log_mels (bool, optional) – whether to use log-mel spectrograms instead of db-scaled. (Default:
False
) - melkwargs (dict or None, optional) – arguments for MelSpectrogram. (Default:
None
)
- sample_rate (int, optional) – Sample rate of audio signal. (Default:
MuLawEncoding¶
-
class
torchaudio.transforms.
MuLawEncoding
(quantization_channels: int = 256) → None[source]¶ Encode signal based on mu-law companding. For more info see the Wikipedia Entry
This algorithm assumes the signal has been scaled to between -1 and 1 and returns a signal encoded with values from 0 to quantization_channels - 1
Parameters: quantization_channels (int, optional) – Number of channels. (Default: 256
)
MuLawDecoding¶
-
class
torchaudio.transforms.
MuLawDecoding
(quantization_channels: int = 256) → None[source]¶ Decode mu-law encoded signal. For more info see the Wikipedia Entry
This expects an input with values between 0 and quantization_channels - 1 and returns a signal scaled between -1 and 1.
Parameters: quantization_channels (int, optional) – Number of channels. (Default: 256
)
Resample¶
ComplexNorm¶
ComputeDeltas¶
TimeStretch¶
-
class
torchaudio.transforms.
TimeStretch
(hop_length: Union[int, NoneType] = None, n_freq: int = 201, fixed_rate: Union[float, NoneType] = None) → None[source]¶ Stretch stft in time without modifying pitch for a given rate.
Parameters: - hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
win_length // 2
) - n_freq (int, optional) – number of filter banks from stft. (Default:
201
) - fixed_rate (float or None, optional) – rate to speed up or slow down by.
If None is provided, rate must be passed to the forward method. (Default:
None
)
-
forward
(complex_specgrams: torch.Tensor, overriding_rate: Union[float, NoneType] = None) → torch.Tensor[source]¶ Parameters: - complex_specgrams (Tensor) – complex spectrogram (…, freq, time, complex=2).
- overriding_rate (float or None, optional) – speed up to apply to this batch.
If no rate is passed, use
self.fixed_rate
. (Default:None
)
Returns: Stretched complex spectrogram of dimension (…, freq, ceil(time/rate), complex=2).
Return type: Tensor
- hop_length (int or None, optional) – Length of hop between STFT windows. (Default:
Fade¶
-
class
torchaudio.transforms.
Fade
(fade_in_len: int = 0, fade_out_len: int = 0, fade_shape: str = 'linear') → None[source]¶ Add a fade in and/or fade out to an waveform.
Parameters: - fade_in_len (int, optional) – Length of fade-in (time frames). (Default:
0
) - fade_out_len (int, optional) – Length of fade-out (time frames). (Default:
0
) - fade_shape (str, optional) – Shape of fade. Must be one of: “quarter_sine”,
“half_sine”, “linear”, “logarithmic”, “exponential”. (Default:
"linear"
)
- fade_in_len (int, optional) – Length of fade-in (time frames). (Default:
FrequencyMasking¶
TimeMasking¶
Vol¶
-
class
torchaudio.transforms.
Vol
(gain: float, gain_type: str = 'amplitude')[source]¶ Add a volume to an waveform.
Parameters: - gain (float) – Interpreted according to the given gain_type: If `gain_type’ = ‘amplitude’, `gain’ is a positive amplitude ratio. If `gain_type’ = ‘power’, `gain’ is a power (voltage squared). If `gain_type’ = ‘db’, `gain’ is in decibels.
- gain_type (str, optional) – Type of gain. One of: ‘amplitude’, ‘power’, ‘db’ (Default:
"amplitude"
)