# How to

The `training`

module provides several implementations of `ImportanceTraining`

that can wrap a *Keras* model and train it with importance sampling.

```
from importance_sampling.training import ImportanceTraining, BiasedImportanceTraining
# assuming model is a Keras model
wrapped_model = ImportanceTraining(model)
wrapped_model = BiasedImportanceTraining(model, k=1.0, smooth=0.5)
wrapped_model.fit(x_train, y_train, epochs=10)
model.evaluate(x_test, y_test)
```

## Sampling probabilites and sample weights

All of the `fit`

methods accept two extra keyword arguments `on_sample`

and
`on_scores`

. They are callables that allow the user of the library to have read
access to the sampling probabilities weights and scores from the performed
importance sampling. Their API is the following,

```
on_sample(sampler, idxs, w, predicted_scores)
```

**Arguments**

**sampler**: The instance of BaseSampler being currently used**idxs**: A numpy array containing the indices that were sampled**w**: A numpy array containing the computed sample weights**predicted_scores**: A numpy array containing the unnormalized importance scores

```
on_scores(sampler, scores)
```

**Arguments**

**sampler**: The instance of BaseSampler being currently used**scores**: A numpy array containing all the importance scores from the presampled data

## Bias

`BiasedImportanceTraining`

and `ApproximateImportanceTraining`

classes accept a
constructor parameter . biases the gradient
estimator to focus more on hard examples, the smaller the value the closer to
max-loss minimization the algorithm is. By default `k=0.5`

which is found to
often improve the generalization performance of the final model.

## Smoothing

Modern deep networks often have innate sources of randomness (e.g. dropout,
batch normalization) that can result in noisy importance predictions. To
alleviate this noise one can smooth the importance using additive smoothing.
The proposed `ImportanceTraining`

class does not use smoothing and we propose
to replace *Dropout* and *BatchNormalization* with regularization and
*LayerNormalization*.

The classes that accept smoothing do so in the following way, the
parameter is added to all importance
predictions before computing the sampling distribution. In addition, they
accept the parameter which when set to `True`

multiplies with as computed
by the moving average of the mini-batch losses.

Although, smooth is initialized at `smooth=0.0`

, if instability is observed
during training, it can be set to small values (e.g. `[0.05, 0.1, 0.5]`

) or one
can use adaptive smoothing with a sane default value for smooth being
`smooth=0.5`

.

## Methods

The wrapped models aim to expose the same `fit`

methods as the original *Keras*
models in order to make their use as simple as possible. The following is a
list of deviations or additions:

`class_weights`

,`sample_weights`

are**not**supported`fit_generator`

accepts a`batch_size`

argument`fit_generator`

is not supported by all`ImportanceTraining`

classes`fit_dataset`

has been added as a method (see Datasets)

Below, follows the list of methods with their arguments.

### fit

```
fit(x, y, batch_size=32, epochs=1, verbose=1, callbacks=None, validation_split=0.0, validation_data=None, steps_per_epoch=None, on_sample=None, on_scores=None)
```

**Arguments**

**x**: Numpy array of training data, lists and dictionaries are not supported**y**: Numpy array of target data, lists and dictionaries are not supported**batch_size**: The number of samples per gradient update**epochs**: Multiplied by`steps_per_epoch`

defines the total number of parameter updates**verbose**: When set`>0`

the*Keras*progress callback is added to the list of callbacks**callbacks**: A list of*Keras*callbacks for logging, changing training parameters, monitoring, etc.**validation_split**: A float in`[0, 1)`

that defines the percentage of the training data to use for evaluation**validation_data**: A tuple of numpy arrays containing data and targets to evaluate the network on**steps_per_epoch**: The number of gradient updates to do in order to assume that an epoch has passed**on_sample**: A callable that accepts the sampler, idxs, w, scores**on_scores**: A callable that accepts the sampler and scores

**Returns**

A *Keras* `History`

instance.

### fit_generator

```
fit_generator(train, steps_per_epoch, batch_size=32, epochs=1, verbose=1, callbacks=None, validation_data=None, validation_steps=None, on_sample=None, on_scores=None)
```

**Arguments**

**train**: A generator yielding tuples of (data, targets)**steps_per_epoch**: The number of gradient updates to do in order to assume that an epoch has passed**batch_size**: The number of samples per gradient update (in contrast to*Keras*this can be variable)**epochs**: Multiplied by`steps_per_epoch`

defines the total number of parameter updates**verbose**: When set`>0`

the*Keras*progress callback is added to the list of callbacks**callbacks**: A list of*Keras*callbacks for logging, changing training parameters, monitoring, etc.**validation_data**: A tuple of numpy arrays containing data and targets to evaluate the network on or a generator yielding tuples of (data, targets)**validation_steps**: The number of tuples to extract from the validation data generator (if a generator is given)**on_sample**: A callable that accepts the sampler, idxs, w, scores**on_scores**: A callable that accepts the sampler and scores

**Returns**

A *Keras* `History`

instance.

### fit_dataset

```
fit_dataset(dataset, steps_per_epoch=None, batch_size=32, epochs=1, verbose=1, callbacks=None, on_sample=None, on_scores=None)
```

The calls to the other `fit*`

methods are delegated to this one after a
`Dataset`

instance has been created. See Datasets for details on how to
create a `Dataset`

and what datasets are available by default.

**Arguments**

**dataset**: Instance of the`Dataset`

class**steps_per_epoch**: The number of gradient updates to do in order to assume that an epoch has passed (if not given equals the number of training samples)**batch_size**: The number of samples per gradient update (in contrast to*Keras*this can be variable)**epochs**: Multiplied by`steps_per_epoch`

defines the total number of parameter updates**verbose**: When set`>0`

the*Keras*progress callback is added to the list of callbacks**callbacks**: A list of*Keras*callbacks for logging, changing training parameters, monitoring, etc.**on_sample**: A callable that accepts the sampler, idxs, w, scores**on_scores**: A callable that accepts the sampler and scores

**Returns**

A *Keras* `History`

instance.

## ImportanceTraining

```
importance_sampling.training.ImportanceTraining(model, presample=3.0, tau_th=None, forward_batch_size=None, score="gnorm", layer=None)
```

`ImportanceTraining`

uses the passed model to compute the importance of the
samples. It computes the variance reduction and enables importance sampling
only when the variance will be reduced more than `tau_th`

. When importance sampling is enabled, it
samples uniformly `presample*batch_size`

samples, then it runs a
**forward pass** for all of them to compute the `score`

and **resamples
according to the importance**.

See our paper for a precise definition of the algorithm.

**Arguments**

**model**: The Keras model to train**presample**: The number of samples to presample for scoring, given as a factor of the batch size**tau_th**: The variance reduction threshold after which we enable importance sampling, when not given it is computed from eq. 29 (it is given in units of batch size increment)**forward_batch_size**: Define the batch size when running the forward pass to compute the importance**score**: Choose the importance score among .`gnorm`

computes an upper bound to the full gradient norm that requires only one forward pass.**layer**: Defines which layer will be used to compute the upper bound (if not given it is automatically inferred). It can also be given as an index in the model's layers property.

## BiasedImportanceTraining

```
importance_sampling.training.BiasedImportanceTraining(model, k=0.5, smooth=0.0, adaptive_smoothing=False, presample=256, forward_batch_size=128)
```

`BiasedImportanceTraining`

uses the model and the loss to compute the per
sample importance. `presample`

data points are sampled uniformly and after a
forward pass on all of them the importance distribution is calculated and we
resample the mini batch.

See the corresponding paper for details.

**Arguments**

**model**: The Keras model to train**k**: Controls the bias of the sampling that focuses the network on the hard examples**smooth**: Influences the sampling distribution towards uniform by additive smoothing**adaptive_smoothing**: When set to`True`

multiplies`smooth`

with the average training loss**presample**: Defines the number of samples to compute the importance for before creating each batch**forward_batch_size**: Define the batch size when running the forward pass to compute the importance

## ApproximateImportanceTraining

```
importance_sampling.training.ApproximateImportanceTraining(model, k=0.5, smooth=0.0, adaptive_smoothing=False, presample=2048)
```

`ApproximateImportanceTraining`

creates a small model that uses the per sample
history of the loss and the class to predict the importance for each sample. It
can be faster than `BiasedImportanceTraining`

but less effective.

See the corresponding paper for details.

**Arguments**

**model**: The Keras model to train**k**: Controls the bias of the sampling that focuses the network on the hard examples**smooth**: Influences the sampling distribution towards uniform by additive smoothing**adaptive_smoothing**: When set to`True`

multiplies`smooth`

with the average training loss**presample**: Defines the number of samples to compute the importance for before creating each batch

## SVRG

```
importance_sampling.training.SVRG(model, B=10., B_rate=1.0, B_over_b=128)
```

`SVRG`

trains a Keras model with stochastic variance reduced gradient.
Specifically it implements the following two variants of SVRG

- SVRG - Accelerating stochastic gradient descent using predictive variance reduction by Johnson R. and Zhang T.
- SCSG - Less than a single pass: Stochastically controlled stochastic gradient by Lei L. and Jordan M.

**Arguments**

**model**: The Keras model to train**B**: The number of batches to use to compute the full batch gradient. For SVRG this should be either a very large number or 0. For SCSG it can be any number larger than 1**B_rate**: A factor to multiply`B`

with after every update**B_over_b**: Compute a batch gradient after every`B_over_b`

gradient updates.