Python API to bob.kaldi¶
This section includes information for using the Python API of bob.kaldi
.
Functions¶
-
bob.kaldi.
cepstral
(data, cepstral_type, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, delta_order=2, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]¶ Computes the cepstral (mfcc/plp) features for given speech samples.
- Parameters
data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the cepstral features from. The input needs to be normalized between [-1, 1].
rate (float) – The sampling rate of the input signal in
data
.cepstral_type (str) – The type of cepstral features: mfcc or plp
preemphasis_coefficient (
float
, optional) – Coefficient for use in signal preemphasisraw_energy (
bool
, optional) – If true, compute energy before preemphasis and windowingdelta_order (
int
, optional) – Add deltas to raw mfcc or plp featuresframe_length (
int
, optional) – Frame length in millisecondsframe_shift (
int
, optional) – Frame shift in millisecondsnum_ceps (
int
, optional) – Number of cepstra in MFCC computation (including C0)num_mel_bins (
int
, optional) – Number of triangular mel-frequency binscepstral_lifter (
int
, optional) – Constant that controls scaling of MFCCslow_freq (
int
, optional) – Low cutoff frequency for mel binshigh_freq (
int
, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)dither (
float
, optional) – Dithering constant (0.0 means no dither)snip_edges (
bool
, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends.normalization (
bool
, optional) – If true, the input samples indata
are normalized to [-1, 1].
- Returns
The cepstral features calculated for the input signal (2D array of 32-bit floats).
- Return type
-
bob.kaldi.
compute_dnn_phone
(samples, rate)[source]¶ Computes phone posteriors on a Kaldi feature matrix
- Parameters
feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
rate (float) – The sampling rate of the input signal in
samples
.
- Returns
The phone posteriors and labels.
- Return type
-
bob.kaldi.
compute_dnn_vad
(samples, rate, silence_threshold=0.9, posterior=0)[source]¶ Performs Voice Activity Detection on a Kaldi feature matrix
- Parameters
feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
rate (float) – The sampling rate of the input signal in
samples
.silence_threshold (
float
, optional) – Silence threshold to be used for silence posterior evaluation.posterior (
int
, optional) – Index of posterior feature to be used for detection. Useful ones are 0, 1 and 2, for silence, laughter and noise,respectively.
- Returns
The labels [1/0] of voiced features (1D array of floats).
- Return type
-
bob.kaldi.
nnet_forward
(feats, nnet, feats_transform='', apply_log=False, no_softmax=False, prior_floor=1e-10, prior_scale=1, use_gpu=False)[source]¶ Computes the forward pass for given features.
- Parameters
feats (numpy.ndarray) – The input cepstral features (2D array of 32-bit floats).
nnet (str) – The neural network
feats_transform (
str
, optional) – The input feature transform forfeats
.apply_log (
bool
, optional) – Transform NN output by log().no_softmax (
bool
, optional) – Removes the last component with Softmax.prior_floor (
float
, optional) – Flooring constant for prior probability.prior_scale (
float
, optional) – Scaling factor to be applied on pdf-log-priors.use_gpu (
bool
, optional) – Compute forward pass on GPU.
- Returns
The posterior features.
- Return type
-
bob.kaldi.
gmm_score
(feats, spkubm, ubm)[source]¶ Print out per-frame log-likelihoods for input utterance.
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
spkubm (str) – A text formatted Kaldi adapted global DiagGMM.
ubm (str) – A text formatted Kaldi global DiagGMM.
- Returns
The average of per-frame log-likelihoods.
- Return type
-
bob.kaldi.
ubm_enroll
(feats, ubm)[source]¶ Performes MAP adaptation of GMM-UBM model.
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
ubm (str) – A text formatted Kaldi global DiagGMM.
- Returns
A text formatted Kaldi enrolled DiagGMM.
- Return type
-
bob.kaldi.
ubm_full_train
(feats, dubm, fubmfile, num_gselect=20, num_iters=4, min_gaussian_weight=0.0001)[source]¶ Implements Kaldi egs/sre10/v1/train_full_ubm.sh
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
dubm (str) – A text formatted trained Kaldi global DiagGMM model.
fubmfile (str) – A path to the full covariance UBM model.
num_gselect (
int
, optional) – Number of Gaussians to keep per frame.num_iters (
int
, optional) – Number of iterations of training.min_gaussian_weight (
float
, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it.
- Returns
A path to the full covariance UBM model.
- Return type
-
bob.kaldi.
ubm_train
(feats, ubmname, num_threads=4, num_frames=500000, min_gaussian_weight=0.0001, num_gauss=2048, num_gauss_init=0, num_gselect=30, num_iters_init=20, num_iters=4, remove_low_count_gaussians=True)[source]¶ Implements Kaldi egs/sre10/v1/train_diag_ubm.sh
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
ubmname (str) – A path to the UBM model.
num_threads (
int
, optional) – Number of threads used for statistics accumulation.num_frames (
int
, optional) – Number of feature vectors to store in memory and train on (randomly chosen from the input features).min_gaussian_weight (
float
, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it.num_gauss (
int
, optional) – Number of Gaussians in the model.num_gauss_init (
int
, optional) – Number of Gaussians in the model initially (if nonzero and less than num_gauss, we’ll do mixture splitting).num_gselect (
int
, optional) – Number of Gaussians to keep per frame.num_iters_init (
int
, optional) – Number of iterations of training for initialization of the single diagonal GMM.num_iters (
int
, optional) – Number of iterations of training.remove_low_count_gaussians (
bool
, optional) – Kaldi MleDiagGmmOptions: If true, remove Gaussians that fall below the floors.
- Returns
A text formatted trained Kaldi global DiagGMM model.
- Return type
-
bob.kaldi.
train_mono
(feats, trans_words, fst_L, topology_in, shared_phones='', numgauss=1000, power=0.25, num_iters=40, beam=6)[source]¶ Monophone model training.
- Parameters
feats (dict) – The input cepstral features (2D array of 32-bit floats).
trans_words (str) – Text transcription of the feats (the word labels)
fst_L (str) – A filename of the lexicon compiled as FST.
topology_in (str) – A topology file that specifies 3-state left-to-right HMM, and default transition probs.
shared_phones (
str
, optional) – A filename of the of phones whose pdfs should be shared.numgauss (
int
, optional) – A number of Gaussians of GMMs.power (
float
, optional) – Power to allocate Gaussians to states.num_iters (
int
, optional) – A number of iteration for re-estimation of GMMs.beam (
float
, optional) – Decoding beam used in alignment.
- Returns
The mono-phones acoustic models.
- Return type
-
bob.kaldi.
ivector_extract
(feats, fubm, ivector_extractor, num_gselect=20, min_post=0.025, posterior_scale=1.0)[source]¶ Implements Kaldi egs/sre10/v1/extract_ivectors.sh
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
fubm (str) – A full-diagonal UBM
ivector_extractor (str) – An ivector extractor model
num_gselect (
int
, optional) – Number of Gaussians to keep per frame.min_post (
float
, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one.posterior_scale (
float
, optional) – A posterior scaling with a global scale.
- Returns
The iVectors calculated for the input signal.
- Return type
-
bob.kaldi.
ivector_train
(feats, fubm, ivector_extractor, num_gselect=20, ivector_dim=600, use_weights=False, num_iters=5, min_post=0.025, num_samples_for_weights=3, posterior_scale=1.0)[source]¶ Implements Kaldi egs/sre10/v1/train_ivector_extractor.sh
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
fubm (str) – A full-diagonal UBM
ivector_extractor (str) – A path for the ivector extractor
num_gselect (
int
, optional) – Number of Gaussians to keep per frame.ivector_dim (
int
, optional) – Dimension of iVector.use_weights (
bool
, optional) – If true, regress the log-weights on the iVectornum_iters (
int
, optional) – Number of iterations of training.min_post (
float
, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one.num_samples_for_weights (
int
, optional) – Number of samples from iVector distribution to use for accumulating stats for weight update. Must be >1.posterior_scale (
float
, optional) – A posterior scaling with a global scale.
- Returns
A text formatted trained Kaldi IvectorExtractor.
- Return type
-
bob.kaldi.
plda_enroll
(feats, pldamean)[source]¶ Implements Kaldi egs/sre10/v1/plda_scoring.sh
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors (of a single speaker).
pldamean (str) – A path to the global PLDA mean file
- Returns
A path to enrolled PLDA model (average iVectors).
- Return type
-
bob.kaldi.
plda_score
(feats, model, plda, globalmean, smoothing=0)[source]¶ Implements Kaldi egs/sre10/v1/plda_scoring.sh
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors.
model (str) – A speaker model (average iVectors).
plda (str) – A PLDA model.
globalmean (str) – A global PLDA mean.
smoothing (float) – Factor used in smoothing within-class covariance (add this factor times between-class covar).
- Returns
A PLDA score.
- Return type
-
bob.kaldi.
plda_train
(feats, plda_file, mean_file)[source]¶ Implements Kaldi egs/sre10/v1/plda_scoring.sh
- Parameters
feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
plda_file (str) – A path to the trained PLDA model
mean_file (str) – A path to the global PLDA mean file
- Returns
Trained PLDA model and global mean (2D str array)
- Return type
-
bob.kaldi.
compute_vad
(samples, rate, vad_energy_mean_scale=0.5, vad_energy_th=5, vad_frames_context=0, vad_proportion_th=0.6)[source]¶ Performs Voice Activity Detection on a Kaldi feature matrix
- Parameters
feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
rate (float) – The sampling rate of the input signal in
samples
.vad_energy_mean_scale (
float
, optional) – If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m + vad-energy-thvad_energy_th (
float
, optional) – Constant term in energy threshold for MFCC0 for VAD.vad_frames_context (
int
, optional) – Number of frames of context on each side of central frame, in window for which energy is monitoredvad_proportion_th (
float
, optional) – Parameter controlling the proportion of frames within the window that need to have more energy than the threshold
- Returns
The labels [1/0] of voiced features (1D array of floats).
- Return type
-
bob.kaldi.
mfcc
(data, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]¶ Computes the MFCCs for given speech samples.
- Parameters
data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the MFCCs from. The input needs to be normalized between [-1, 1].
rate (float) – The sampling rate of the input signal in
data
.preemphasis_coefficient (
float
, optional) – Coefficient for use in signal preemphasisraw_energy (
bool
, optional) – If true, compute energy before preemphasis and windowingframe_length (
int
, optional) – Frame length in millisecondsframe_shift (
int
, optional) – Frame shift in millisecondsnum_ceps (
int
, optional) – Number of cepstra in MFCC computation (including C0)num_mel_bins (
int
, optional) – Number of triangular mel-frequency binscepstral_lifter (
int
, optional) – Constant that controls scaling of MFCCslow_freq (
int
, optional) – Low cutoff frequency for mel binshigh_freq (
int
, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)dither (
float
, optional) – Dithering constant (0.0 means no dither)snip_edges (
bool
, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends.normalization (
bool
, optional) – If true, the input samples indata
are normalized to [-1, 1].
- Returns
The MFCCs calculated for the input signal (2D array of 32-bit floats).
- Return type
-
bob.kaldi.
mfcc_from_path
(filename, channel=0, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True)[source]¶ Computes the MFCCs for a given input signal recorded into a file
- Parameters
filename (str) – A path to a valid WAV or NIST Sphere file to read data from
channel (int) – The audio channel to read from inside the file
preemphasis_coefficient (
float
, optional) – Coefficient for use in signal preemphasisraw_energy (
bool
, optional) – If true, compute energy before preemphasis and windowingframe_length (
int
, optional) – Frame length in millisecondsframe_shift (
int
, optional) – Frame shift in millisecondsnum_ceps (
int
, optional) – Number of cepstra in MFCC computation (including C0)num_mel_bins (
int
, optional) – Number of triangular mel-frequency binscepstral_lifter (
int
, optional) – Constant that controls scaling of MFCCslow_freq (
int
, optional) – Low cutoff frequency for mel binshigh_freq (
int
, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)dither (
float
, optional) – Dithering constant (0.0 means no dither)snip_edges (
bool
, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends
- Returns
The MFCCs calculated for the input signal (2D array of 32-bit floats).
- Return type