# Swiss Machine Learning Day 2018

The event took place on **Wednesday, Nov 7th, 2018**, in room ROOM 5ABC of the Swisstech Convention Center on EPFL campus.

Organized by François Fleuret and Martin Jaggi.

# Program

- 09:15 – Crowd-Robot Interaction: Crowd-aware Robot Navigation with Attention-based Deep Reinforcement Learning (Changan Chen, EPFL)
- 09:30 – A Physically-Consistent Bayesian Non-Parametric Mixture Model for Dynamical System Learning (Nadia Figueroa, EPFL)
- 09:45 – Sample-Efficient Imitation Learning via Generative Adversarial Nets (Lionel Blonde, UNIGE)
- 10:00 – Sparsified SGD with Memory (Jean-Baptiste Cordonnier, EPFL)
- 10:15 – Local Rotation Invariance and Directional Sensitivity of 3D Texture Operators: Comparing Classical Radiomics, CNNs and Spherical Harmonics (Vincent Andrearczyk, HES/SO)
- 10:30 – Coffee break & Posters
- 11:15 – Not All Samples Are Created Equal: Deep Learning with Importance Sampling (Angelos Katharopoulos, IDIAP)
- 11:30 – Evading classifiers in discrete domains with provable optimality guarantees (Bogdan Kulynych, EPFL)
- 11:45 – Regression Concept Vectors and Bidirectional Explanations of Deep Learning predictions (Mara Graziani, HES/SO)
- 12:00 – Beyond Optimism: Information Directed Exploration in Bandits (Johannes Kirschner, ETHZ)
- 12:15 – CNN-based face presentation attack detection using patches of faces (Amir Mohammadi, IDIAP)
- 12:30 – Lunch
- 13:45 – Semi-blind spatially-variant deconvolution and focus estimation in optical microscopy by use of convolutional neural networks (Adrian Shajkofci, IDIAP)
- 14:00 – Efficient spherical Convolutional Neural Network with HEALPix sampling for cosmology (Nathanaël Perraudin, ETHZ)
- 14:15 – Weakly and unsupervised disentangling factors of variation (Attila Szabó, UNIBE)
- 14:30 – A few pixels can make a big difference (Apostolos Modas, EPFL)
- 14:45 – Deep Self-Organization: Interpretable Discrete Representation Learning on Time Series (Vincent Fortuin, ETHZ)
- 15:00 – Coffee break & Posters
- 15:45 – Autonomous Drone Racing (Elia Kaufmann, UZH)
- 16:00 – Privacy-Preserving Classification with Secret Vector Machines (Valentin Hartmann, EPFL)
- 16:15 – Patient-Attentive Sequential Strategy: A Reinforcement Learning-based Perimetry Test (Serife Kucur, UNIBE)
- 16:30 – Trace and Detect Adversarial Attacks on CNNs using Feature Response Maps. (Amirian Mohammadreza, ZHAW)
- 16:45 – Variational Methods to find all solutions of an Optimal Control problem (Thibaut Kulak, IDIAP)

# Abstracts

## Not All Samples Are Created Equal: Deep Learning with Importance Sampling (Angelos Katharopoulos, IDIAP)

Deep neural network training spends most of the computation on examples that are properly handled, and could be ignored.

We propose to mitigate this phenomenon with a principled importance sampling scheme that focuses computation on ``informative'' examples, and reduces the variance of the stochastic gradients during training. Our contribution is twofold: first, we derive a tractable upper bound to the per-sample gradient norm, and second we derive an estimator of the variance reduction achieved with importance sampling, which enables us to switch it on when it will result in an actual speedup.

The resulting scheme can be used by changing a few lines of code in a standard SGD procedure, and we demonstrate experimentally, on image classification, CNN fine-tuning, and RNN training, that for a fixed wall-clock time budget, it provides a reduction of the train losses of up to an order of magnitude and a relative improvement of test errors between 5% and 17%.

## Sparsified SGD with Memory (Jean-Baptiste Cordonnier, EPFL)

Huge scale machine learning problems are nowadays tackled by distributed optimization algorithms, i.e. algorithms that leverage the compute power of many devices for training. The communication overhead is a key bottleneck that hinders perfect scalability. Various recent works proposed to use quantization or sparsification techniques to reduce the amount of data that needs to be communicated, for instance by only sending the most significant entries of the stochastic gradient (top-k sparsification). Whilst such schemes showed very promising performance in practice, they have eluded theoretical analysis so far. In this work we analyze Stochastic Gradient Descent (SGD) with k-sparsification or compression (for instance top-k or random-k) and show that this scheme converges at the same rate as vanilla SGD when equipped with error compensation (keeping track of accumulated errors in memory). That is, communication can be reduced by a factor of the dimension of the problem (sometimes even more) whilst still converging at the same rate. We present numerical experiments to illustrate the theoretical findings and the better scalability for distributed applications.

## Local Rotation Invariance and Directional Sensitivity of 3D Texture Operators: Comparing Classical Radiomics, CNNs and Spherical Harmonics (Vincent Andrearczyk, HES/SO)

We define and investigate the Local Rotation Invariance (LRI) and Directional Sensitivity (DS) of radiomics features. Most of the classical features cannot combine the two properties, which are antagonist in simple designs. We propose texture operators based on spherical harmonic wavelets (SHW) invariants and show that they are both LRI and DS. An experimental comparison of SHW, popular radiomics operators and O-group equivariant Convolutional Neural Networks (CNNs) for classifying 3D textures reveals the importance of combining the two properties for optimal pattern characterization.

## Crowd-Robot Interaction: Crowd-aware Robot Navigation with Attention-based Deep Reinforcement Learning (Changan Chen, EPFL)

Mobility in an effective and socially-compliant manner is an essential yet challenging task for robots operatingin crowded spaces. Recent works have shown the power of deep reinforcement learning techniques to learn socially cooperative policies. However, their cooperation ability deteriorates as the crowd grows since they typically relax the problem as a one-way Human-Robot interaction problem. In this work, we want to go beyond first-order Human-Robot interaction and more explicitly model Crowd-Robot Interaction (CRI). We propose to (i) rethink pairwise interactions with a self-attention mechanism, and (ii) jointly model Human-Robot as well as Human-Human interactions in the deep reinforcement learning framework. Our model captures the Human-Human interactions occurring in dense crowds that indirectly affects the robot’s anticipation capability. Our proposed attentive pooling mechanism learns the collective importance of neighboring humans with respect to their future states. Various experiments demonstrate that our model can anticipate human dynamics and navigate in crowds with time efficiency, outperforming state-of-the-art methods.

## Sample-Efficient Imitation Learning via Generative Adversarial Nets (Lionel Blonde, UNIGE)

Recent work in imitation learning articulate their formulation around the GAIL architecture, relying on the adversarial training procedure introduced in GANs. Albeit successful at generating behaviours similar to those demonstrated to the agent, GAIL suffers from a high sample complexity in the number of interactions it has to carry out in the environment in order to achieve satisfactory performance. In this work, we dramatically shrink the amount of interactions with the environment by leveraging an off-policy actor-critic architecture. Additionally, employing deterministic policy gradients allows us to treat the learned reward as a differentiable node in the computational graph, while preserving the model-free nature of our approach. Our experiments span a variety of continuous control tasks.

## A Physically-Consistent Bayesian Non-Parametric Mixture Model for Dynamical System Learning (Nadia Figueroa, EPFL)

We propose a physically-consistent Bayesian non-parametric approach for fitting Gaussian Mixture Models (GMM) to trajectory data. Physical-consistency of the GMM is ensured by imposing a prior on the component assignments biased by a novel similarity metric that leverages locality and directionality of the trajectories. The resulting GMM is then used to learn globally asymptotically stable Dynamical Systems (DS) via a Linear Parameter Varying (LPV) re-formulation. The proposed DS learning scheme accurately encodes challenging nonlinear motions (i.e. high curvature, non-monotonic) automatically; i.e. without model selection. Finally, a data-efficient incremental learning approach is introduced that encodes a DS from batches of trajectories, while preserving global stability. Our proposed approaches are validated on a variety of tasks that involve single-target complex motions with a KUKA LWR 4+ robot arm.

## Evading classifiers in discrete domains with provable optimality guarantees (Bogdan Kulynych, EPFL)

It has been recently shown that machine learning models are vulnerable to efficient gradient-based attacks that produce adversarial examples, i.e., examples that induce an error of a classifier at test time. Most of these attacks rely on adding specific perturbations to normal examples, such that the perturbed examples stay within the original data distribution. The assumption that one can perturb an example while remaining within the data distribution is easily satisfied in continuous domains such as images. Security-critical applications of machine learning like malware, spam, or fraud detection, however, often use hand-crafted feature vectors as inputs which result in constrained discrete domains. Therefore, adding perturbations to such examples may result in feature vectors that cannot appear in real life. Moreover, even if a transformation of a feature vector is theoretically realizable, it may be costly to actually implement.

We introduce a framework for crafting adversarial examples in such constrained discrete domains. In this framework the space of possible adversarial manipulations is represented as a graph, and then the problem of finding adversarial examples is that of graph search. Our framework not only formalizes many existing attacks on machine learning models in discrete domains, but also allows to obtain formal guarantees that obtained adversarial examples have minimal cost. It additionally allows to specify more complex costs than commonly used $L_p$ metrics. We demonstrate the utility of the framework, successfully producing adversarial examples for several security-critical applications: credit scoring, social media bot detection, and website fingerprinting.

## Regression Concept Vectors and Bidirectional Explanations of Deep Learning predictions (Mara Graziani, HES/SO)

Interpreting deep neural network predictions can be valuable in applications where justifications are important for confidence in the decision-making, such as the medical domain. In this work, we propose a methodology to explain the internal state of any network architecture in terms of domain-related concepts. We exploit continuous concept measures as Regression Concept Vectors (RCVs) in the activation space of a layer. The directional derivative of the network output along a RCV represents the network sensitivity to increasing values of the concept measure. We evaluate score robustness and consistency by statistical analysis. We apply RCVs to various applications including traditional computer vision tasks, rethinopathy of prematurity and breast cancer grading. When applied to the latter, nuclei texture emerges as a relevant concept in the detection of tumor tissue in breast lymph node samples.

## Beyond Optimism: Information Directed Exploration in Bandits (Johannes Kirschner, ETHZ)

Upper confidence bound algorithms and Thompson Sampling are popular methods for trading of exploration and exploitation in bandits. However, some recent findings indicate, that these strategies fail to explore efficiently under non-standard assumptions. This is problematic, because policies based on optimism or Thompson Sampling are increasingly applied to complex reinforcement learning environments. We review an alternative design principle: Information Directed Sampling (IDS). The IDS framework links algorithm design and regret analysis, and allows to account for more fine-grained information-regret structures.

## CNN-based face presentation attack detection using patches of faces (Amir Mohammadi, IDIAP)

Face presentation attack detection (PAD) (a.k.a. anti-spoofing) is still a challenging issue. Most methods do not generalize to session differences. Early face PAD methods used cues such as texture, color distortion, and other hand-crafted features to detect presentation attacks.

Recently, convolutional neural networks (CNN) have been used on raw images for face PAD. However, generalization remains a challenge. In this work, we present a patch-based CNN using a simple CNN architecture for face PAD. The method is tested on several presentation attack datasets and results are promising in terms of the generalization of the method.

## Semi-blind spatially-variant deconvolution and focus estimation in optical microscopy by use of convolutional neural networks (Adrian Shajkofci, IDIAP)

Optical microscopy allows acquiring qualitative and quantitative data about cellular function, organ development, or diseases. Image formation can be modeled as the convolution of the original object with a point spread function (PSF), which fully characterizes the imaging system. Knowledge of the PSF corresponding to the optical system can be used to restore details in the image through deconvolution. Measurement or estimation of the PSF can be cumbersome and difficult in practice, in particular when the PSF varies in space. We recently developed a semi-blind, spatially-variant deconvolution technique aimed at optical microscopy that combines a local estimation step of the PSF and a deconvolution step based on a spatially-variant, regularized Richardson-Lucy algorithm. To find the local PSF map in a computationally tractable way, we rely on a convolutional neural network (CNN) to estimate the parameters of the PSF model via regression. Here, we describe our recent efforts to exploit the local parameter map: as a depth-from-focus image merging tool and as an auto-focus plugin which allows estimating the best local focus depth using only a few acquisition points.

## Efficient spherical Convolutional Neural Network with HEALPix sampling for cosmology (Nathanaël Perraudin, ETHZ)

Convolutional Neural Networks (CNNs) are a cornerstone of the Deep Learning toolbox and have led to many breakthroughs in Artificial Intelligence. Nevertheless, so far, these networks have mostly been developed for regular Euclidean domains such as those supporting images, audio, or video. In particular, in the field of cosmology, data often comes as spherical maps, which make the use of traditional CNNs more complicated. The commonly used pixelization scheme for spherical maps is the Hierarchical Equal Area isoLatitude Pixelisation (HEALPix). In this talk, we present HealpixNet, a new spherical neural network architecture specially tailored for HEALPix. As our approach rely on a graph approximating the sphere, we give inside on the graph Fourier transform and the graph convolution. While we focus on the spherical case, the approach presented is general and can be applied to many irregular domains.

## Weakly and unsupervised disentangling factors of variation (Attila Szabó, UNIBE)

We study the problem of disentangling factors of variation and attribute transfer. For that purpose we train autoencoders, where the latent features are split into chunks. On the feature level we aim to learn representations, where each feature chunk captures only one independent factor of the data and is invariant to the other factors. On the image level attribute transfer can be achieved when we mix the features of multiple input images and feed the mix to the decoder. The output image then has a mix of attributes of the inputs.

We present network architectures based on GANs for weakly supervised and unsupervised training.

In the weakly supervised case we face two main challenges: One is that the model may learn degenerate mappings, which we call shortcut problem, and the other is that the attribute representation for an image is not guaranteed to follow the same interpretation on another image, which we call reference ambiguity. To address the shortcut problem, we introduce novel constraints on image pairs and triplets and show their effectiveness both analytically and experimentally. In the case of the reference ambiguity, we formally prove that a model that guarantees an ideal feature separation cannot be built.

In the unsupervised case we train an autoencoder with two novel training objectives: first, we propose an invariance objective to encourage that encoding of each attribute, and decoding of each chunk, are invariant to changes in other attributes and chunks, respectively; second, we include a classification objective, which ensures that each chunk corresponds to a consistently discernible attribute in the represented image, hence avoiding degenerate feature mappings where some chunks are completely ignored.

We demonstrate the effectiveness of our approach on the MNIST, Sprites, and CelebA datasets.

## A few pixels can make a big difference (Apostolos Modas, EPFL)

Deep Neural Networks have achieved extraordinary results on many image classification tasks, but have been shown to be quite vulnerable to small, carefully crafted perturbations of the input data, known as adversarial perturbations. The majority of the existing methods for computing such perturbations, alter almost every pixel of the image with noise of small magnitude, resulting to imperceptible but dense perturbations.

However, the behavior of Deep Neural Networks when only a small fraction of the input is altered, has not been thoroughly studied. For better understanding the vulnerabilities of these architectures – and thus improve them –, it is of high importance to study their robustness against sparse adversarial perturbations, especially to extreme scenarios where only one, or a few pixels of the input image are changed.

In this work, we propose a fast and effective algorithm for computing sparse adversarial perturbations, and show that almost every image can fool the network, by perturbing on average only 1.81% (MNIST), 1.44% (CIFAR-10), and 0.37% (ImageNet) of its pixels.

## Deep Self-Organization: Interpretable Discrete Representation Learning on Time Series (Vincent Fortuin, ETHZ)

High-dimensional time series are common in many domains. Since human cognition is not optimized to work well in high-dimensional spaces, these areas could benefit from interpretable low-dimensional representations. However, most representation learning algorithms for time series data are difficult to interpret. This is due to non-intuitive mappings from data features to salient properties of the representation and non-smoothness over time.

To address this problem, we propose a new representation learning framework building on ideas from interpretable discrete dimensionality reduction and deep generative modeling. This framework allows us to learn discrete representations of time series, which give rise to smooth and interpretable embeddings with superior clustering performance. We introduce a new way to overcome the non-differentiability in discrete representation learning and present a gradient-based version of the traditional self-organizing map algorithm that is more performant than the original. Furthermore, to allow for a probabilistic interpretation of our method, we integrate a Markov model in the representation space.

This model uncovers the temporal transition structure, improves clustering performance even further and provides additional explanatory insights as well as a natural representation of uncertainty.

We evaluate our model in terms of clustering performance and interpretability on static (Fashion-)MNIST data, a time series of linearly interpolated (Fashion-)MNIST images, a chaotic Lorenz attractor system with two macro states, as well as on a challenging real world medical time series application on the eICU data set. Our learned representations compare favorably with competitor methods and facilitate downstream tasks on the real world data.

## Patient-Attentive Sequential Strategy: A Reinforcement Learning-based Perimetry Test (Serife Kucur, UNIBE)

Perimetry is a non-invasive clinical psychometric examination key in the diagnosis and management of ophthalmic and neurological conditions. At its core, perimetry relies on a human subject pressing a button whenever they have witnessed a light stimulus that could be of different intensities or at different locations of their filed of view. This sequential process yields a 2D visual field map critical for disease monitoring. Unfortunately, perimetry is painfully slow, lasting 7-8 minutes of time per eye which is exhausting for subjects who need to maintain high levels of concentration during the examination. In this work, we present a novel perimetry testing strategy that requires fewer locations to be tested in order to effectively estimate the 2D visual field map. Our approach relies on adaptively selecting which locations of the visual field to test given previously tested locations. Based on reinforcement learning, we learn which locations to examine in order to reconstruct the visual field as accurately as possible from few observations. This allows our approach to be extremely efficient in how it uses previous measurements in a patient-specific manner. We experimentally show on two datasets of patients that our method allows more accurate reconstruction of visual fields in less time than its counterparts.

## Privacy-Preserving Classification with Secret Vector Machines (Valentin Hartmann, EPFL)

Today, large amounts of valuable data are distributed among millions of user-held devices, such as personal computers or phones. Many companies collect such data in order to train machine learning models. User-held data, however, is often sensitive, and collecting it is problematic in terms of privacy. We address this issue by proposing a novel way of training a supervised classifier in a distributed setting akin to the recently proposed federated learning paradigm (McMahan et al. 2017), but under the stricter privacy requirement that the server that trains the model is assumed to be untrusted and potentially malicious; we thus preserve user privacy by design, rather than by trust. In particular, our framework, called secret vector machine (SecVM), provides an algorithm for training linear support vector machines (SVM) in a setting in which data-holding clients communicate with an untrusted server by exchanging messages designed to not reveal any personally identifiable information. In offline experiments, we show that we can preserve user privacy without sacrificing accuracy. We then implement SecVM’s distributed framework for a web browser and deploy it for predicting user gender in a large-scale online evaluation with thousands of clients.

## Autonomous Drone Racing (Elia Kaufmann, UZH)

Autonomous agile flight brings up fundamental challenges in robotics, such as coping with unreliable state estimation, reacting optimally to dynamically changing environments, and coupling perception and action in real time under severe resource constraints. In our work, we consider these challenges in the context of autonomous, vision-based drone racing in dynamic environments.

Our approach combines a convolutional neural network with a state-of-the-art path-planning and control system. The CNN directly maps raw images into a robust representation in the form of a waypoint and desired speed. This information is then used by the planner to generate a short, minimum-jerk trajectory segment and corresponding motor commands to reach the desired goal. We demonstrate our method in autonomous agile flight scenarios, in which a vision-based quadrotor traverses drone-racing tracks with possibly moving gates. Our method does not require any explicit map of the environment and runs fully onboard.

## Trace and Detect Adversarial Attacks on CNNs using Feature Response Maps. (Amirian Mohammadreza, ZHAW)

The existence of adversarial attacks on convolutional neural networks (CNN) questions the fitness of such models for serious applications. The attacks manipulate an input image such that misclassification is evoked while still looking normal to a human observer---they are thus not easily detectable. In a different context, backpropagated activations of CNN hidden layers---``feature responses'' to a given input---have been helpful to visualize for a human ``debugger'' what the CNN ``looks at'' while computing its output. In this work, we propose a novel detection method for adversarial examples to prevent attacks. We do so by tracking adversarial perturbations in feature responses, allowing for automatic detection using average local spatial entropy. The method does not alter the original network architecture and is fully human-interpretable. Experiments confirm the validity of our approach for state-of-the-art attacks on large-scale models trained on ImageNet.

I will also bring a poster with the recent results in the deep score project for musical symbol detection and Information Retrieval. I am looking forward to meeting you soon.

## Variational Methods to find all solutions of an Optimal Control problem (Thibaut Kulak, IDIAP)

In optimal control, a robot aims to optimize a series of motor commands in order to minimize the cost of a task over the whole trajectory. A typical task can be, for instance, reaching an object while minimizing control commands. Whereas in many applications it is sufficient to know the series of motor commands to apply to minimize the cost, there are many cases where it is interesting to know all of the different possible ways to achieve the task. I will show how variational methods can be used to calculate those, using a Gaussian mixture approximation. I will then show examples of this approach in the context of inverse optimal control and intention recognition.