Python API to bob.kaldi

This section includes information for using the Python API of bob.kaldi.

Functions

bob.kaldi.cepstral(data, cepstral_type, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, delta_order=2, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]

Computes the cepstral (mfcc/plp) features for given speech samples.

Parameters
  • data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the cepstral features from. The input needs to be normalized between [-1, 1].

  • rate (float) – The sampling rate of the input signal in data.

  • cepstral_type (str) – The type of cepstral features: mfcc or plp

  • preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis

  • raw_energy (bool, optional) – If true, compute energy before preemphasis and windowing

  • delta_order (int, optional) – Add deltas to raw mfcc or plp features

  • frame_length (int, optional) – Frame length in milliseconds

  • frame_shift (int, optional) – Frame shift in milliseconds

  • num_ceps (int, optional) – Number of cepstra in MFCC computation (including C0)

  • num_mel_bins (int, optional) – Number of triangular mel-frequency bins

  • cepstral_lifter (int, optional) – Constant that controls scaling of MFCCs

  • low_freq (int, optional) – Low cutoff frequency for mel bins

  • high_freq (int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)

  • dither (float, optional) – Dithering constant (0.0 means no dither)

  • snip_edges (bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends.

  • normalization (bool, optional) – If true, the input samples in data are normalized to [-1, 1].

Returns

The cepstral features calculated for the input signal (2D array of 32-bit floats).

Return type

numpy.ndarray

bob.kaldi.compute_dnn_phone(samples, rate)[source]

Computes phone posteriors on a Kaldi feature matrix

Parameters
  • feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector

  • rate (float) – The sampling rate of the input signal in samples.

Returns

The phone posteriors and labels.

Return type

numpy.ndarray

bob.kaldi.compute_dnn_vad(samples, rate, silence_threshold=0.9, posterior=0)[source]

Performs Voice Activity Detection on a Kaldi feature matrix

Parameters
  • feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector

  • rate (float) – The sampling rate of the input signal in samples.

  • silence_threshold (float, optional) – Silence threshold to be used for silence posterior evaluation.

  • posterior (int, optional) – Index of posterior feature to be used for detection. Useful ones are 0, 1 and 2, for silence, laughter and noise,respectively.

Returns

The labels [1/0] of voiced features (1D array of floats).

Return type

numpy.ndarray

bob.kaldi.nnet_forward(feats, nnet, feats_transform='', apply_log=False, no_softmax=False, prior_floor=1e-10, prior_scale=1, use_gpu=False)[source]

Computes the forward pass for given features.

Parameters
  • feats (numpy.ndarray) – The input cepstral features (2D array of 32-bit floats).

  • nnet (str) – The neural network

  • feats_transform (str, optional) – The input feature transform for feats.

  • apply_log (bool, optional) – Transform NN output by log().

  • no_softmax (bool, optional) – Removes the last component with Softmax.

  • prior_floor (float, optional) – Flooring constant for prior probability.

  • prior_scale (float, optional) – Scaling factor to be applied on pdf-log-priors.

  • use_gpu (bool, optional) – Compute forward pass on GPU.

Returns

The posterior features.

Return type

numpy.ndarray

bob.kaldi.gmm_score(feats, spkubm, ubm)[source]

Print out per-frame log-likelihoods for input utterance.

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.

  • spkubm (str) – A text formatted Kaldi adapted global DiagGMM.

  • ubm (str) – A text formatted Kaldi global DiagGMM.

Returns

The average of per-frame log-likelihoods.

Return type

float

bob.kaldi.ubm_enroll(feats, ubm)[source]

Performes MAP adaptation of GMM-UBM model.

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.

  • ubm (str) – A text formatted Kaldi global DiagGMM.

Returns

A text formatted Kaldi enrolled DiagGMM.

Return type

str

bob.kaldi.ubm_full_train(feats, dubm, fubmfile, num_gselect=20, num_iters=4, min_gaussian_weight=0.0001)[source]

Implements Kaldi egs/sre10/v1/train_full_ubm.sh

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.

  • dubm (str) – A text formatted trained Kaldi global DiagGMM model.

  • fubmfile (str) – A path to the full covariance UBM model.

  • num_gselect (int, optional) – Number of Gaussians to keep per frame.

  • num_iters (int, optional) – Number of iterations of training.

  • min_gaussian_weight (float, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it.

Returns

A path to the full covariance UBM model.

Return type

str

bob.kaldi.ubm_train(feats, ubmname, num_threads=4, num_frames=500000, min_gaussian_weight=0.0001, num_gauss=2048, num_gauss_init=0, num_gselect=30, num_iters_init=20, num_iters=4, remove_low_count_gaussians=True)[source]

Implements Kaldi egs/sre10/v1/train_diag_ubm.sh

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.

  • ubmname (str) – A path to the UBM model.

  • num_threads (int, optional) – Number of threads used for statistics accumulation.

  • num_frames (int, optional) – Number of feature vectors to store in memory and train on (randomly chosen from the input features).

  • min_gaussian_weight (float, optional) – Kaldi MleDiagGmmOptions: Min Gaussian weight before we remove it.

  • num_gauss (int, optional) – Number of Gaussians in the model.

  • num_gauss_init (int, optional) – Number of Gaussians in the model initially (if nonzero and less than num_gauss, we’ll do mixture splitting).

  • num_gselect (int, optional) – Number of Gaussians to keep per frame.

  • num_iters_init (int, optional) – Number of iterations of training for initialization of the single diagonal GMM.

  • num_iters (int, optional) – Number of iterations of training.

  • remove_low_count_gaussians (bool, optional) – Kaldi MleDiagGmmOptions: If true, remove Gaussians that fall below the floors.

Returns

A text formatted trained Kaldi global DiagGMM model.

Return type

str

bob.kaldi.train_mono(feats, trans_words, fst_L, topology_in, shared_phones='', numgauss=1000, power=0.25, num_iters=40, beam=6)[source]

Monophone model training.

Parameters
  • feats (dict) – The input cepstral features (2D array of 32-bit floats).

  • trans_words (str) – Text transcription of the feats (the word labels)

  • fst_L (str) – A filename of the lexicon compiled as FST.

  • topology_in (str) – A topology file that specifies 3-state left-to-right HMM, and default transition probs.

  • shared_phones (str, optional) – A filename of the of phones whose pdfs should be shared.

  • numgauss (int, optional) – A number of Gaussians of GMMs.

  • power (float, optional) – Power to allocate Gaussians to states.

  • num_iters (int, optional) – A number of iteration for re-estimation of GMMs.

  • beam (float, optional) – Decoding beam used in alignment.

Returns

The mono-phones acoustic models.

Return type

str

bob.kaldi.ivector_extract(feats, fubm, ivector_extractor, num_gselect=20, min_post=0.025, posterior_scale=1.0)[source]

Implements Kaldi egs/sre10/v1/extract_ivectors.sh

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.

  • fubm (str) – A full-diagonal UBM

  • ivector_extractor (str) – An ivector extractor model

  • num_gselect (int, optional) – Number of Gaussians to keep per frame.

  • min_post (float, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one.

  • posterior_scale (float, optional) – A posterior scaling with a global scale.

Returns

The iVectors calculated for the input signal.

Return type

numpy.ndarray

bob.kaldi.ivector_train(feats, fubm, ivector_extractor, num_gselect=20, ivector_dim=600, use_weights=False, num_iters=5, min_post=0.025, num_samples_for_weights=3, posterior_scale=1.0)[source]

Implements Kaldi egs/sre10/v1/train_ivector_extractor.sh

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.

  • fubm (str) – A full-diagonal UBM

  • ivector_extractor (str) – A path for the ivector extractor

  • num_gselect (int, optional) – Number of Gaussians to keep per frame.

  • ivector_dim (int, optional) – Dimension of iVector.

  • use_weights (bool, optional) – If true, regress the log-weights on the iVector

  • num_iters (int, optional) – Number of iterations of training.

  • min_post (float, optional) – If nonzero, posteriors below this threshold will be pruned away and the rest will be renormalized to sum to one.

  • num_samples_for_weights (int, optional) – Number of samples from iVector distribution to use for accumulating stats for weight update. Must be >1.

  • posterior_scale (float, optional) – A posterior scaling with a global scale.

Returns

A text formatted trained Kaldi IvectorExtractor.

Return type

str

bob.kaldi.plda_enroll(feats, pldamean)[source]

Implements Kaldi egs/sre10/v1/plda_scoring.sh

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors (of a single speaker).

  • pldamean (str) – A path to the global PLDA mean file

Returns

A path to enrolled PLDA model (average iVectors).

Return type

str

bob.kaldi.plda_score(feats, model, plda, globalmean, smoothing=0)[source]

Implements Kaldi egs/sre10/v1/plda_scoring.sh

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing iVectors.

  • model (str) – A speaker model (average iVectors).

  • plda (str) – A PLDA model.

  • globalmean (str) – A global PLDA mean.

  • smoothing (float) – Factor used in smoothing within-class covariance (add this factor times between-class covar).

Returns

A PLDA score.

Return type

float

bob.kaldi.plda_train(feats, plda_file, mean_file)[source]

Implements Kaldi egs/sre10/v1/plda_scoring.sh

Parameters
  • feats (numpy.ndarray) – A 2D numpy ndarray object containing MFCCs.

  • plda_file (str) – A path to the trained PLDA model

  • mean_file (str) – A path to the global PLDA mean file

Returns

Trained PLDA model and global mean (2D str array)

Return type

str

bob.kaldi.compute_vad(samples, rate, vad_energy_mean_scale=0.5, vad_energy_th=5, vad_frames_context=0, vad_proportion_th=0.6)[source]

Performs Voice Activity Detection on a Kaldi feature matrix

Parameters
  • feats (numpy.ndarray) – A 2-D numpy array, with log-energy being in the first component of each feature vector

  • rate (float) – The sampling rate of the input signal in samples.

  • vad_energy_mean_scale (float, optional) – If this is set to s, to get the actual threshold we let m be the mean log-energy of the file, and use s*m + vad-energy-th

  • vad_energy_th (float, optional) – Constant term in energy threshold for MFCC0 for VAD.

  • vad_frames_context (int, optional) – Number of frames of context on each side of central frame, in window for which energy is monitored

  • vad_proportion_th (float, optional) – Parameter controlling the proportion of frames within the window that need to have more energy than the threshold

Returns

The labels [1/0] of voiced features (1D array of floats).

Return type

numpy.ndarray

bob.kaldi.mfcc(data, rate=8000, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True, normalization=True)[source]

Computes the MFCCs for given speech samples.

Parameters
  • data (numpy.ndarray) – A 1D numpy ndarray object containing 64-bit float numbers with the audio signal to calculate the MFCCs from. The input needs to be normalized between [-1, 1].

  • rate (float) – The sampling rate of the input signal in data.

  • preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis

  • raw_energy (bool, optional) – If true, compute energy before preemphasis and windowing

  • frame_length (int, optional) – Frame length in milliseconds

  • frame_shift (int, optional) – Frame shift in milliseconds

  • num_ceps (int, optional) – Number of cepstra in MFCC computation (including C0)

  • num_mel_bins (int, optional) – Number of triangular mel-frequency bins

  • cepstral_lifter (int, optional) – Constant that controls scaling of MFCCs

  • low_freq (int, optional) – Low cutoff frequency for mel bins

  • high_freq (int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)

  • dither (float, optional) – Dithering constant (0.0 means no dither)

  • snip_edges (bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends.

  • normalization (bool, optional) – If true, the input samples in data are normalized to [-1, 1].

Returns

The MFCCs calculated for the input signal (2D array of 32-bit floats).

Return type

numpy.ndarray

bob.kaldi.mfcc_from_path(filename, channel=0, preemphasis_coefficient=0.97, raw_energy=True, frame_length=25, frame_shift=10, num_ceps=13, num_mel_bins=23, cepstral_lifter=22, low_freq=20, high_freq=0, dither=1.0, snip_edges=True)[source]

Computes the MFCCs for a given input signal recorded into a file

Parameters
  • filename (str) – A path to a valid WAV or NIST Sphere file to read data from

  • channel (int) – The audio channel to read from inside the file

  • preemphasis_coefficient (float, optional) – Coefficient for use in signal preemphasis

  • raw_energy (bool, optional) – If true, compute energy before preemphasis and windowing

  • frame_length (int, optional) – Frame length in milliseconds

  • frame_shift (int, optional) – Frame shift in milliseconds

  • num_ceps (int, optional) – Number of cepstra in MFCC computation (including C0)

  • num_mel_bins (int, optional) – Number of triangular mel-frequency bins

  • cepstral_lifter (int, optional) – Constant that controls scaling of MFCCs

  • low_freq (int, optional) – Low cutoff frequency for mel bins

  • high_freq (int, optional) – High cutoff frequency for mel bins (if < 0, offset from Nyquist)

  • dither (float, optional) – Dithering constant (0.0 means no dither)

  • snip_edges (bool, optional) – If true, end effects will be handled by outputting only frames that completely fit in the file, and the number of frames depends on the frame-length. If false, the number of frames depends only on the frame-shift, and we reflect the data at the ends

Returns

The MFCCs calculated for the input signal (2D array of 32-bit floats).

Return type

numpy.ndarray

bob.kaldi.get_config()[source]

Returns a string containing the configuration information.