My research is about statistical learning, from techniques using hand-designed stochastic models, to learning problems such as feature selection, active learning, or efficient Boosting. My work is applied mainly to computer-vision, both in my group at the Idiap research institute and in collaboration with the Computer Vision Lab at EPFL and the Universitat Politècnica de Catalunya.

This research is supported by:

- The Swiss National Science Foundation.
- The Hasler Foundation.
- The Swiss Commission for Technology and Innovation.
- The Ark foundation.
- The European Commission.

Joint work with Cijo Jose.

We have investigated the learning of a Mahalanobis distance for re-identification. The technique we developed optimizes (a proxy for) the precision-at-rank-k, combined with a new regularization term that favors orthonormal embeddings (Jose & Fleuret 2016). This penalty prevents the learning to degenerate to rank-deficient mappings and makes the algorithm behave very well on small and medium-sized data-sets.

Joint work with James Newling.

The k-means procedure is one of the most (maybe *the* most)
used machine-learning algorithm. It iteratively optimizes k
centroids to minimize the summed distance to training
points. Despite its simplicity, it can be computationally intensive
when dealing with very large data-sets or high dimension
spaces. Accelerated exact variants have been developed which use the
triangle inequality to maintain bounds on the distances between
data-points and centroids, and avoid unnecessary distance
computations.

We have developed a new exact algorithm named "Exponion" that organizes the data points in shells of increasing thicknesses around each centroid, and maintains the corresponding distance bounds (Newling & Fleuret 2016). On standard benchmarks, this approach beats existing exact fast variants of k-means, with a computation time cut in half in the best cases compared to the second best method.

Joint work with Olivier Canévet, Cijo Jose, and Leonidas Lefakis.

Most of the learning-based detectors rely on a form
of *bootstrapping*, which consists of building a sequence of
detectors getting more and more accurate, each one trained from the
samples incorrectly classified by the previous ones.

At the core of such a technique lies a re-sampling procedure, to extract informative samples from a large data-set. The proportion of these informative samples is very small, and most of the computation goes in filtering out samples properly classified.

We have investigated different strategies to address this problem in an efficient manner. The first approach we have developed uses the regularity of the predictor's response in the image plan to reject groups of samples each time one is examined (Canévet & Fleuret 2014). The second one relies on a hierarchical version of a sample-selection procedure we developed for Boosting, and iteratively filters out large sets of samples to keep only an informative sub-set (Canévet, Lefakis & Fleuret 2014).

Following that idea, we have applied a modified version of the Monte-Carlo tree search to the problem. Instead of sampling a large tree of possible moves as for playing a strategy game, here we want to sample a large tree of samples to sample among the good ones. The resulting algorithm leverages a tree-structure over the samples (datasets / families / images / sub-images) where we define a "win" as finding a false-positives. The main challenge is to adapt the MCTS from finding the single best move to sampling among all the "good moves" (Canévet & Fleuret 2016).

Recently we have started to investigate a different approach, still based on a prior tree-structure over the samples, to improve the stochastic gradient descent by re-sampling samples according to an estimate of the gradient norm (Canévet et al. 2016).

Joint work with Leonidas Lefakis.

Feature selection is often based on the individual information content of features, which may overlook groups of features jointly informative, or select groups of redundant features.

We have developed a novel method dubbed GC.MI to explicitly
estimate the mutual information between a group of continuous features
and a discrete class to predict, under the assumption that the
conditional distribution of the former given the latter is Gaussian
(Lefakis &
Fleuret 2016). This procedure relies on a *Gaussian
compromise* to approximate the Entropy of the features without
conditioning—since they follow a mixture of Gaussians—and
we have developed an efficient organization of the computation to make
the overall procedure tractable.

Our reference open-source implementation is available under the GPLv3.

Joint work with Adrián Peñate Sánchez, Francesc Moreno-Noguer, and Juan Andrade Cetto.

Don Geman and I introduced a few years
ago the idea of *pose-indexed
features*, which are designed to extract measurements from
a signal given a hypothesized target position, so that the
measurements are *stationary* and do not depend
statistically on the said position. We have recently extended the
use of these pose-indexed features from classical object detection
to rigid object pose estimation.

The strategy we have devised (Penate et al. 2014) consists of using high-quality training images that allow us to do reliable key-point matching, so that we can automatically derive an analytical definition of pose-indexed features, and train a discriminative predictor on low-quality images from the said features.

Experimental evaluation shows that this procedure combines the benefits of full-image matching and that of point-of-interest based methods, and can deal with extremely low-quality test images.

Joint work with Leonidas Lefakis.

To solve a mimicking problem, that is the learning of a policy from
demonstrations provided by a *teacher*, we developed a novel
discriminative procedure able to automatically detect
*macro-actions*.

Consider for instance a task where you have to reach a blue flag, touch it, and then reach a red flag. Without the notion of macro-action, a demonstration of the correct policy seems inconsistent: You see the teacher sometime moving toward the blue flag, even if the red is visible, and sometime moving toward the red, even if the blue is visible.

Our DPBoost procedure (Lefakis & Fleuret 2014b) alternatively optimizes several action predictors corresponding to as many macro-actions, and the segmentation of the demonstration sequences into sub-sequences associated to the said macro-actions. Both steps minimize the same exponential loss, the first by adding weak learners, the second with a dynamic programming procedure.

Joint work with Leonidas Lefakis.

The current trend toward very large data-sets is more and more obsoleting batch learning approaches. However, standard on-line alternatives such as the stochastic gradient do not leverage the availability of a memory to store samples temporarily.

We propose the new idea of *reservoir learning*, to use a
sub-set of samples extracted from a stream, and updated while novel
samples get available. Our first algorithm based on this idea extends
the Boosting approach
(Lefakis &
Fleuret 2013).

Joint work with Charles Dubout.

The most standard strategy for object detection consists of evaluating a linear predictor at multiple scales and locations in the image to process. Even sophisticated Deformable Part Models are based on such linear filters for detecting the object itself and its individual "parts".

To speed-up this process, we have proposed to use the standard trick of doing the convolution in the frequency domain, in which case it takes the form of a point-wise product. Making this idea practical requires a careful organization of the computation to deal with the memory size and memory access speed restriction (Dubout & Fleuret 2012). We distribute our implementation under the GPL3. It is close to one order of magnitude faster than the best pre-existing algorithms.

We have recently extended the same idea to the training of the SVM itself (Dubout & Fleuret 2013), and used it to speed-up a version of the DPM with individual part scaling (Dubout & Fleuret 2013b).

Joint work with Charles Dubout.

In the context of the MASH project (Fleuret et al. 2011), we want to build classifiers from a large number of families of features. We are currently handling ~30 such families, and we target learning systems able to cope with at least one order of magnitude more.

We have proposed two variants of Adaboost to cope with these difficulties. The first one, dubbed Tasting (Dubout & Fleuret 2011), consists of sampling a few features from each family before the learning starts, and to use this features to estimate at every Boosting step the most promising feature family, so that we can bias the sampling accordingly. We distribute the source code for that work under a mix of GPLv2 and BSD licenses.

The second one, that we named Maximum Adaptive Sampling (Dubout & Fleuret 2011b) models the loss reduction as a function of the number of features looked at, and the number of samples used to estimate edges. This model allows to optimize the trade-off between the two.

Joint work with Nicolae Suditu.

We are adapting a state-of-the-art interactive image retrieval system to very large image data sets. The algorithm we started from relies on the estimate of a probability of relevance of any image, given the interaction the user had with the system, under a sound statistical model.

We proposed to use a hierarchical partitioning of the image collection computed off-line, and modulate during the interactive search the resolution in different parts of the collection (Suditu & Fleuret 2011). This strategy maintains an accurate approximation of individual probabilities of relevance of images, while fixing an upper bound on the required computation.

More recently we have investigated the exploitation / exploration dilemma inherent to interactive image retrieval systems. The exploration of a collection of images usually consists of a first part where the user roughly tries to locate the category of images she is interested in semantically, followed by a part where she wants to select specific images, with certain signal properties. Our new algorithm evaluates how well the user's behavior matched the system's predictions, and concentrates the sampling of images in the relevant ones accordingly. This allows progressive transitions from a purely exploratory behavior to an exploiting one, and provides a significant increase of performance (Suditu & Fleuret 2012).

Joint work with Karim Ali and David Hasler.

To limit the need for labelled data, we have developed a new method to exploit motion consistency in videos (Ali et al. 2011). We start by labeling the targets in a few frames of the video, and from then we alternate the training of an appearance-based classifier, and the estimation of trajectories physically possible and consistent with the appearance-based detection.

We managed to minimize the same loss in both steps, and leverage the flow-based multi-target tracking we have developed for multi-camera tracking.

The resulting procedure is amazingly stable and we obtain in certain cases with a fraction of the frames labeled better results than with the full sequence labeled. This is probably due to the ability of the system to be more consistent through the sequence in its labeling, and to chose locations of target more friendly to the appearance-based detector.

Joint work with Leonidas Lefakis.

One of the very standard approaches to object detection consists of training a sequence of two-class classifiers, each one designed to catch all the positive examples and filter out as much as possible the negative samples not caught by its predecessors.

We have developed a Boosting variant which trains all these classifiers simultaneously, by adding weak learners in each of them sequentially (Lefakis & Fleuret 2010). This procedure relies on a stochastic interpretation of the classifier responses: Each one is interpreted as the (log ratio of the) probability that the classifier Binary response is positive, and the overall response of the cascade is the (log ratio of the) probability that they all respond positively, under an assumption of independence.

The resulting procedure pushes all the classifiers to respond properly on the positive samples, and pushes the classifiers which are "already good" to get even better on each negative samples.

Joint work with Horesh Ben Shitrit, Jérôme Berclaz and Pascal Fua.

We are investigating the detection of individuals in video streams and estimation of their locations on the ground plane (Fleuret et al. 2008). The first part of our algorithm is the Probabilistic Occupancy Map, which consists of estimating in each time frame independently an approximation of the marginal probabilities of presence at every location with a naive mean-field type procedure.

The second part filters these detections with a time-based
regularization. We introduce a graph whose vertices are locations in
time/space, and edges are motions which are physically possible
(Berclaz et
al. 2011). Finding the trajectories of an *a priori* unknown
number of targets boils down to finding a flow minimizing a linear
cost in this graph, which is a convex problem that can be solved very
efficiently. Recently, we have added an appearance-based model to
identify the two teams in a basketball game
(Ben Shitrit et al.
2011).

Also, we have investigated how to model more sophisticated behaviors by introducing "behavioral maps" (Berclaz et al. 2008). The motion model is estimated through a generalized EM procedure from raw frame-by-frame detection results. This new model improves the quality of the tracking and allows for the automatic detection of atypical behaviors.

There are plenty of results and videos on the CVLab page.

You can download the source code of the Probabilistic Occupancy Map distributed under the terms of the version 3 of the GNU General Public Licence.

Joint work with Donald Geman.

We developed a novel algorithm for the detection and the estimation of the pose of complex objects in cluttered scene (Fleuret & Geman 2008). We propose the notion of pose-indexed features which should ideally have a response distribution which does not depend on the pose, given that a target is present. This novel idea allows to train a single classifier common to many different poses, and fixes the main weakness of the coarse-to-fine strategy for object detection (Fleuret & Geman 2001). We demonstrate the performance of that approach on cat detection. You can watch a talk given at Google Zurich in October 2008 on the topic.

You can download the complete data set and the source code. The latter is distributed under the terms of the version 3 of the GNU General Public Licence.

Joint work with Karim Ali, David Hasler and Pascal Fua.

The pose-indexed features we introduced for the cat detection require a dense exploration of a geometrical pose space during detection. Controlling the cost of this search is a key issue, and we have usually used methods combining coarse-to-fine representations and lazy evaluations.

In this project, we try another venue to address the same issue. We propose to learn jointly both the features and estimators of the latent pose parameters. These estimators are closed form rules able to compute a parameter directly from the signal.

We tested the efficiency of this new idea in the context of hand detection for industrial applications (Ali et al. 2009).

Boosting can be seen as a gradient descent: at every step, the learning procedure adds a weak learner corresponding to the direction of maximum local reduction of a loss. Hence, there is a natural generalization to a multi-layer structure similar to MLP. The derivative of the loss with respect to the intermediate responses is propagated through the predictor, and weak learners are added in the inner functionals in a similar fashion as with classical boosting.

As expected, such a structure with a hidden layer performs substantially better for pattern recognition than the combination of classical boosting with a local feature extraction trained in a non-supervised manner (Fleuret 2009).

Joint work with Germán González Serrano and Pascal Fua.

This work tackles the problem of automatic network delineation, with an emphasis on dendritic trees in neural tissues (Gonzalez et al. 2008). We proposed a novel approach mixing machine learning for the local characterization of filament-like locations, with a Bayesian modeling and stochastic optimization of the global tree network. We have investigated the use of steerable filters to create a filament detector which can be dedicated on-the-fly to an arbitrary orientation for a minimal computational overhead in both 2D (Gonzalez et al. 2009) and 3D (Gonzalez et al. 2009b).

You can see a short video (6Mb, divx) illustrating a simpler 2D algorithm we developed during a pre-study (Fleuret & Fua 2006) for dendrite reconstruction.

Joint work with Gilles Blanchard.

One of the main weakness of learning techniques is the need for
large training sets. We worked on the learning of high-level
invariance from a large set of images, to be able to learn the
appearance of a new object from a single image of it. We called our
approach *Chopping*
(Fleuret &
Blanchard 2005) since it relies on a very large number of binary
splits of the image space.

You can download the database of images of 150 latex symbols we used, in mnist-similar matlab format.

Feature selection based on conditional mutual information (Fleuret 2004) gives good statistical performances and is computationally efficient. This algorithm can select tens of features from a family of tens of thousands in a less than a second on a standard PC. Experiments demonstrate that on tasks such as drug screening or image classification, a naive Bayesian classifier combining features selected with this technique achieves error rates similar to those obtained with sophisticated techniques such as SVM or boosting.

You can download the source code of the CMIM feature selection algorithm, distributed under the terms of the version 3 of the GNU General Public Licence.

Beside this main themes, I also worked on multiple testing (Blanchard & Fleuret 2007), cancerous cells detection and on key-point characterization (Özuysal et al. 2006).

My PhD and part of my post-doc was about object recognition and face-detection. We developed an original coarse-to-fine algorithm very efficient both in speed and error rates (Fleuret 2000, Fleuret & Geman 2001, Fleuret & Geman 2002).

I also worked on kernel design (Fleuret & Sahbi 2003, Boughorbel et al. 2005, Fleuret & Gerstner 2005), functional neural networks (Rossi, Conan-Guez & Fleuret 2002a, Rossi, Conan-Guez & Fleuret 2002b), goal-planning (Fleuret & Brunet 2000), texture segmentation (Shahrokni et al 2009), content-based image retrieval (Boujemaa et al. 2001, Boujemaa et al. 2004) and object recognition (Jedynak & Fleuret 1996, unfortunately in French).