Python API to bob.kaldi¶

This section includes information for using the Python API of bob.kaldi.

Functions¶

bob.kaldi.cepstral(data, cepstral_type, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, delta_order=2, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]¶

Computes the cepstral (mfcc/plp) features for given speech samples.

Parameters

data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the cepstral features from. The input needs to be normalized between [-1, 1].
rate (float) – The sampling rate of the input signal in data.
cepstral_type (str) – The type of cepstral features: mfcc or plp
preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis
raw_energy (bool, optional) – If true, compute energy before preemphasis and windowing
delta_order (int, optional) – Add deltas to raw mfcc or plp features
frame_length (int, optional) – Frame length in milliseconds
frame_shift (int, optional) – Frame shift in milliseconds
num_ceps (int, optional) – Number of cepstra in MFCC computation (including C0)
num_mel_bins (int, optional) – Number of triangular mel-frequency bins
cepstral_lifter (int, optional) – Constant that controls scaling of MFCCs
low_freq (int, optional) – Low cutoff frequency for mel bins
high_freq (int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)
dither (float, optional) – Dithering constant (0.0 means no dither)
snip_edges (bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends.
normalization (bool, optional) – If true, the input samples in data are normalized to [-1, 1].

Returns

The cepstral features calculated for the input signal (2D array of 32-bit floats).

Return type

numpy.ndarray

bob.kaldi.compute_dnn_phone(samples, rate)[source]¶

Computes phone posteriors on a Kaldi feature matrix

Parameters

feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
rate (float) – The sampling rate of the input signal in samples.

Returns

The phone posteriors and labels.

Return type

numpy.ndarray

bob.kaldi.compute_dnn_vad(samples, rate, silence_threshold=0.9, posterior=0)[source]¶

Performs Voice Activity Detection on a Kaldi feature matrix

Parameters

feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
rate (float) – The sampling rate of the input signal in samples.
silence_threshold (float, optional) – Silence threshold to be used for silence posterior evaluation.
posterior (int, optional) – Index of posterior feature to be used for detection. Useful ones are 0, 1 and 2, for silence, laughter and noise,respectively.

Returns

The labels [1/0] of voiced features (1D array of floats).

Return type

numpy.ndarray

bob.kaldi.nnet_forward(feats, nnet, feats_transform='', apply_log=False, no_softmax=False, prior_floor=1e-10, prior_scale=1, use_gpu=False)[source]¶

Computes the forward pass for given features.

Parameters

feats (numpy.ndarray) – The input cepstral features (2D array of 32-bit floats).
nnet (str) – The neural network
feats_transform (str, optional) – The input feature transform for feats.
apply_log (bool, optional) – Transform NN output by log().
no_softmax (bool, optional) – Removes the last component with Softmax.
prior_floor (float, optional) – Flooring constant for prior probability.
prior_scale (float, optional) – Scaling factor to be applied on pdf-log-priors.
use_gpu (bool, optional) – Compute forward pass on GPU.

Returns

The posterior features.

Return type

numpy.ndarray

bob.kaldi.gmm_score(feats, spkubm, ubm)[source]¶

Print out per-frame log-likelihoods for input utterance.

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
spkubm (str) – A text formatted Kaldi adapted global DiagGMM.
ubm (str) – A text formatted Kaldi global DiagGMM.

Returns

The average of per-frame log-likelihoods.

Return type

float

bob.kaldi.ubm_enroll(feats, ubm)[source]¶

Performes MAP adaptation of GMM-UBM model.

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
ubm (str) – A text formatted Kaldi global DiagGMM.

Returns

A text formatted Kaldi enrolled DiagGMM.

Return type

str

bob.kaldi.ubm_full_train(feats, dubm, fubmfile, num_gselect=20, num_iters=4, min_gaussian_weight=0.0001)[source]¶

Implements Kaldi egs/sre10/v1/train_full_ubm.sh

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
dubm (str) – A text formatted trained Kaldi global DiagGMM model.
fubmfile (str) – A path to the full covariance UBM model.
num_gselect (int, optional) – Number of Gaussians to keep per frame.
num_iters (int, optional) – Number of iterations of training.
min_gaussian_weight (float, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it.

Returns

A path to the full covariance UBM model.

Return type

str

bob.kaldi.ubm_train(feats, ubmname, num_threads=4, num_frames=500000, min_gaussian_weight=0.0001, num_gauss=2048, num_gauss_init=0, num_gselect=30, num_iters_init=20, num_iters=4, remove_low_count_gaussians=True)[source]¶

Implements Kaldi egs/sre10/v1/train_diag_ubm.sh

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
ubmname (str) – A path to the UBM model.
num_threads (int, optional) – Number of threads used for statistics accumulation.
num_frames (int, optional) – Number of feature vectors to store in memory and train on (randomly chosen from the input features).
min_gaussian_weight (float, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it.
num_gauss (int, optional) – Number of Gaussians in the model.
num_gauss_init (int, optional) – Number of Gaussians in the model initially (if nonzero and less than num_gauss, we’ll do mixture splitting).
num_gselect (int, optional) – Number of Gaussians to keep per frame.
num_iters_init (int, optional) – Number of iterations of training for initialization of the single diagonal GMM.
num_iters (int, optional) – Number of iterations of training.
remove_low_count_gaussians (bool, optional) – Kaldi MleDiagGmmOptions: If true, remove Gaussians that fall below the floors.

Returns

A text formatted trained Kaldi global DiagGMM model.

Return type

str

bob.kaldi.train_mono(feats, trans_words, fst_L, topology_in, shared_phones='', numgauss=1000, power=0.25, num_iters=40, beam=6)[source]¶

Monophone model training.

Parameters

feats (dict) – The input cepstral features (2D array of 32-bit floats).
trans_words (str) – Text transcription of the feats (the word labels)
fst_L (str) – A filename of the lexicon compiled as FST.
topology_in (str) – A topology file that specifies 3-state left-to-right HMM, and default transition probs.
shared_phones (str, optional) – A filename of the of phones whose pdfs should be shared.
numgauss (int, optional) – A number of Gaussians of GMMs.
power (float, optional) – Power to allocate Gaussians to states.
num_iters (int, optional) – A number of iteration for re-estimation of GMMs.
beam (float, optional) – Decoding beam used in alignment.

Returns

The mono-phones acoustic models.

Return type

str

bob.kaldi.ivector_extract(feats, fubm, ivector_extractor, num_gselect=20, min_post=0.025, posterior_scale=1.0)[source]¶

Implements Kaldi egs/sre10/v1/extract_ivectors.sh

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
fubm (str) – A full-diagonal UBM
ivector_extractor (str) – An ivector extractor model
num_gselect (int, optional) – Number of Gaussians to keep per frame.
min_post (float, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one.
posterior_scale (float, optional) – A posterior scaling with a global scale.

Returns

The iVectors calculated for the input signal.

Return type

numpy.ndarray

bob.kaldi.ivector_train(feats, fubm, ivector_extractor, num_gselect=20, ivector_dim=600, use_weights=False, num_iters=5, min_post=0.025, num_samples_for_weights=3, posterior_scale=1.0)[source]¶

Implements Kaldi egs/sre10/v1/train_ivector_extractor.sh

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
fubm (str) – A full-diagonal UBM
ivector_extractor (str) – A path for the ivector extractor
num_gselect (int, optional) – Number of Gaussians to keep per frame.
ivector_dim (int, optional) – Dimension of iVector.
use_weights (bool, optional) – If true, regress the log-weights on the iVector
num_iters (int, optional) – Number of iterations of training.
min_post (float, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one.
num_samples_for_weights (int, optional) – Number of samples from iVector distribution to use for accumulating stats for weight update. Must be >1.
posterior_scale (float, optional) – A posterior scaling with a global scale.

Returns

A text formatted trained Kaldi IvectorExtractor.

Return type

str

bob.kaldi.plda_enroll(feats, pldamean)[source]¶

Implements Kaldi egs/sre10/v1/plda_scoring.sh

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors (of a single speaker).
pldamean (str) – A path to the global PLDA mean file

Returns

A path to enrolled PLDA model (average iVectors).

Return type

str

bob.kaldi.plda_score(feats, model, plda, globalmean, smoothing=0)[source]¶

Implements Kaldi egs/sre10/v1/plda_scoring.sh

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors.
model (str) – A speaker model (average iVectors).
plda (str) – A PLDA model.
globalmean (str) – A global PLDA mean.
smoothing (float) – Factor used in smoothing within-class covariance (add this factor times between-class covar).

Returns

A PLDA score.

Return type

float

bob.kaldi.plda_train(feats, plda_file, mean_file)[source]¶

Implements Kaldi egs/sre10/v1/plda_scoring.sh

Parameters

feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.
plda_file (str) – A path to the trained PLDA model
mean_file (str) – A path to the global PLDA mean file

Returns

Trained PLDA model and global mean (2D str array)

Return type

str

bob.kaldi.compute_vad(samples, rate, vad_energy_mean_scale=0.5, vad_energy_th=5, vad_frames_context=0, vad_proportion_th=0.6)[source]¶

Performs Voice Activity Detection on a Kaldi feature matrix

Parameters

feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector
rate (float) – The sampling rate of the input signal in samples.
vad_energy_mean_scale (float, optional) – If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m + vad-energy-th
vad_energy_th (float, optional) – Constant term in energy threshold for MFCC0 for VAD.
vad_frames_context (int, optional) – Number of frames of context on each side of central frame, in window for which energy is monitored
vad_proportion_th (float, optional) – Parameter controlling the proportion of frames within the window that need to have more energy than the threshold

Returns

The labels [1/0] of voiced features (1D array of floats).

Return type

numpy.ndarray

bob.kaldi.mfcc(data, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]¶

Computes the MFCCs for given speech samples.

Parameters

data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the MFCCs from. The input needs to be normalized between [-1, 1].
rate (float) – The sampling rate of the input signal in data.
preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis
raw_energy (bool, optional) – If true, compute energy before preemphasis and windowing
frame_length (int, optional) – Frame length in milliseconds
frame_shift (int, optional) – Frame shift in milliseconds
num_ceps (int, optional) – Number of cepstra in MFCC computation (including C0)
num_mel_bins (int, optional) – Number of triangular mel-frequency bins
cepstral_lifter (int, optional) – Constant that controls scaling of MFCCs
low_freq (int, optional) – Low cutoff frequency for mel bins
high_freq (int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)
dither (float, optional) – Dithering constant (0.0 means no dither)
snip_edges (bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends.
normalization (bool, optional) – If true, the input samples in data are normalized to [-1, 1].

Returns

The MFCCs calculated for the input signal (2D array of 32-bit floats).

Return type

numpy.ndarray

bob.kaldi.mfcc_from_path(filename, channel=0, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True)[source]¶

Computes the MFCCs for a given input signal recorded into a file

Parameters

filename (str) – A path to a valid WAV or NIST Sphere file to read data from
channel (int) – The audio channel to read from inside the file
preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis
raw_energy (bool, optional) – If true, compute energy before preemphasis and windowing
frame_length (int, optional) – Frame length in milliseconds
frame_shift (int, optional) – Frame shift in milliseconds
num_ceps (int, optional) – Number of cepstra in MFCC computation (including C0)
num_mel_bins (int, optional) – Number of triangular mel-frequency bins
cepstral_lifter (int, optional) – Constant that controls scaling of MFCCs
low_freq (int, optional) – Low cutoff frequency for mel bins
high_freq (int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)
dither (float, optional) – Dithering constant (0.0 means no dither)
snip_edges (bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends

Returns

The MFCCs calculated for the input signal (2D array of 32-bit floats).

Return type

numpy.ndarray

bob.kaldi.get_config()[source]¶: Returns a string containing the configuration information.