# Swiss Machine Learning Day 2017

The event will take place on Tuesday, November 28th 2017

Due to the large number of registered attendees, the event has been moved to ROOM 5ABC of the SwissTech Convention Center.

**The even is free. Registration is now closed.**

Please contact François Fleuret or Martin Jaggi for any question or comment.

# Program

- 09:00 – 09:25
- Direct Sparse Convolutional Neural Networks with Attention (Hackel Timo, ETHZ)
- 09:25 – 09:50
- How close are the eigenvectors of the sample and actual covariance matrices (Andreas Loukas, EPFL)
- 09:50 – 10:15
- Change point detection for high-dimensional linear regression and its applications for covariance matrices (Kovacs Solt, ETHZ)
- 10:15 – 10:40
- An Investigation of Deep Neural Network for Multilingual Speech Recognition Training and Adaptation (Sibo Tong, IDIAP)
- 10:40 – 10:55
- Break
- 10:55 – 11:20
- Composite minimization using global geometry and approximate subproblems (Sai Praneeth Reddy Karimireddy, EPFL)
- 11:20 – 11:45
- Geometric robustness of deep networks: analysis and improvement (Can Kanbak, EPFL)
- 11:45 – 12:10
- K-Medoids for K-Means Seeding (James Newling, IDIAP)
- 12:10 – 12:35
- Smooth Primal-Dual Coordinate Descent Algorithms for Nonsmooth Convex Optimization (Alacaoglu Ahmet, EPFL)
- 12:35 – 13:35
- Lunch
- 13:35 – 14:00
- Quality monitoring in laser welding using multi-kernel Laplacian graph support vector machine (Sergey Shevchik, EMPA)
- 14:00 – 14:25
- Learning to Run - Deep Reinforcement Learning with Musculoskeletal Models (Sharada Prasanna Mohanty, EPFL)
- 14:25 – 14:50
- DroNet: Learning to Fly by Driving (Antonio Loquercio, UZH)
- 14:50 – 15:15
- Learning Multi-Modal Problems in Computer Vision (Pierre Baqué, EPFL)
- 15:15 – 15:30
- Break
- 15:30 – 15:55
- Lifelong Generative Modeling (Jason Ramapuram, UNIGE)
- 15:55 – 16:20
- Representation Learning by Learning to Count (Mehdi Noroozi, UNIBE)
- 16:20 – 16:45
- Improving learning in robotics by exploiting the structure and geometry of the data: application to prosthetic hands (Noémie Jaquier, IDIAP)
- 16:45 – 17:10
- A Deep Learning Approach to Ultrasound Image Recovery (Dimitris Perdios, EPFL)
- 17:10 – 17:35
- Learning non-linear transform with discriminative and minimum information loss priors (Dimche Kostadinov, UNIGE)

# Abstracts

## Direct Sparse Convolutional Neural Networks with Attention (Hackel Timo, ETHZ)

Nowadays, the power of Convolutional Neural Networks (CNNs) depends strongly on computational resources. While CNN are well designed to handle dense data, for instance 2D images, they lack the ability to efficiently handle sparse data.

The goal of this work is to introduce a neural network, which makes use of both the sparseness in feature maps and filter weights to obtain a smaller memory footprint and faster computations than it would be possible with dense CNNs.

The basic design of our convolutional layer is a direct convolution, similar to recent works [1, 2]. However, when working with sparse data there arise two problems: First, computational resources are limited. Second, convolutions rapidly decrease the sparseness in data. We propose a filter step within the convolutional layer in order to create an upper bound on the required computational resources while at the same time keeping the data sparse, which we call attention. Further more we introduce a small modification of the backpropagation term so that it is possible to train our sparse model with existing optimization frameworks, such as tensorflow, while exploiting the sparseness within the models and data.

[1] Parashar, Angshuman, et al. "SCNN: An Accelerator for Compressed-sparse Convolutional Neural Networks." Proceedings of the 44th Annual International Symposium on Computer Architecture. ACM, 2017. [2] Park, Jongsoo, et al. "Faster cnns with direct sparse convolutions and guided pruning." ICLR, 2017.

## How close are the eigenvectors of the sample and actual covariance matrices (Andreas Loukas, EPFL)

How many samples are sufficient to guarantee that the eigenvectors and eigenvalues of the sample covariance matrix are close to those of the actual covariance matrix? For a wide family of distributions, including distributions with finite second moment and distributions supported in a centered Euclidean ball, we prove that the inner product between eigenvectors of the sample and actual covariance matrices decreases proportionally to the respective eigenvalue distance. Our findings imply non-asymptotic concentration bounds for eigenvectors, eigenspaces, and eigenvalues. They also provide conditions for distinguishing principal components based on a constant number of samples.

## Change point detection for high-dimensional linear regression and its applications for covariance matrices (Kovacs Solt, ETHZ)

We pursue the goal of high-dimensional (p>n) covariance matrix estimation for data with abrupt structural changes. We try to detect these changes and estimate the covariance matrices in the resulting segments. Our approaches closely follow the proposal of Leonardi and Bühlmann (arXiv:1601.03704) for change point detection in the case of high-dimensional linear regression. We consider the therein proposed estimator in more general setups and propose estimation procedures for covariance matrices based on this regression estimator, as well as another procedure, which is the analogy of the regression estimator, but directly for the case of covariance matrices. We present theoretical results, simulations for the comparison of these proposals, advantages and disadvantages, as well as an illustration of the developed methodology on a real-life example of stock returns.

## An Investigation of Deep Neural Network for Multilingual Speech Recognition Training and Adaptation (Sibo Tong, IDIAP)

Different training and adaptation techniques for multilingual Automatic Speech Recognition (ASR) are explored in the context of hybrid systems, exploiting Deep Neural Networks (DNN) and Hidden Markov Models (HMM). In multilingual DNN training, the hidden layers (possibly extracting bottleneck features) are usually shared across languages, and the output layer can either model multiple sets of language-specific senones or one single universal IPA-based multilingual senone set. Both architectures are investigated, exploiting and comparing different language adaptive training (LAT) techniques originating from successful DNN-based speaker-adaptation. More specifically, speaker adaptive training methods such as Cluster Adaptive Training (CAT) and Learning Hidden Unit Contribution (LHUC) are considered. In addition, a language adaptive output architecture for IPA-based universal DNN is also studied and tested.

Experiments show that LAT improves the performance and adaptation on the top layer further improves the accuracy. By combining state-level minimum Bayes risk (sMBR) sequence training with LAT, we show that a language adaptively trained IPA-based universal DNN outperforms a monolingually sequence trained model.

## Composite minimization using global geometry and approximate subproblems (Sai Praneeth Reddy Karimireddy, EPFL)

Coordinate based first order optimization algorithms consist of two steps performed every round - i) computing the gradient (or some coordinates of it), and ii) computing the next point. In practice, there is often a wide mismatch between the time required for the two steps leading to underutilization of resources.

In this work, we propose a new framework, Approx Composite Minimization (ACM), which incorporates global geometry to make better use of the gradients in case the gradient computation is expensive, and performs approximate updates to ensure that the update step does not take too long. Our framework unifies and extends several coordinate descent schemes such as block-coordinate descent and SDNA, as well as their parallel and distributed counterparts. Along the way, we provide a novel proof for linear convergence of proximal algorithms which may be of independent interest.

## Geometric robustness of deep networks: analysis and improvement (Can Kanbak, EPFL)

Deep convolutional neural networks have been shown to be vulnerable to arbitrary geometric transformations. However, there is no systematic method to measure the invariance properties of deep networks to such transformations. We propose ManiFool as a simple yet scalable algorithm to measure the invariance of deep networks. In particular, our algorithm measures the robustness of deep networks to geometric transformations in a worst-case regime as they can be problematic for sensitive applications. Our extensive experimental results show that ManiFool can be used to measure the invariance of fairly complex networks on high dimensional datasets and these values can be used for analyzing the reasons for it. Furthermore, we build on Manifool to propose a new adversarial training scheme and we show its effectiveness on improving the invariance properties of deep neural networks.

## K-Medoids for K-Means Seeding (James Newling, IDIAP)

We show experimentally that the algorithm CLARANS of Ng and Han (1994) finds better K-medoids solutions than the Voronoi iteration algorithm of Hastie et al (2001). This finding, along with the similarity between the Voronoi iteration algorithm and Lloyd's K-means algorithm, motivates us to use CLARANS as a K-means initializer. We show that CLARANS outperforms other algorithms on 23/23 datasets with a mean decrease over K-Means-++ (Arthur and Vassilvitskii, 2007) of 30% for initialization mean squared error (MSE) and 3% for final MSE. We introduce algorithmic optimizations to CLARANS which improve its complexity and runtime, making it a viable initialization scheme for large datasets.

## Smooth Primal-Dual Coordinate Descent Algorithms for Nonsmooth Convex Optimization (Alacaoglu Ahmet, EPFL)

We propose a new randomized coordinate descent method for a convex optimization template with broad applications. Our analysis relies on a novel combination of four ideas applied to the primal-dual gap function: smoothing, acceleration, homotopy, and non-uniform sampling. As a result, our method features the first convergence rate guarantees that are the best-known under a variety of common structure assumptions on the template. We provide numerical evidence to support the theoretical results with a comparison to state-of-the-art algorithms.

## Quality monitoring in laser welding using multi-kernel Laplacian graph support vector machine (Sergey Shevchik, EMPA)

Online quality monitoring in laser welding remains unsolved until today due to the complexity of the underlying physical phenomena. This talk is a machine learning approach regarding this problem that provides classification of the measureable physical parameters of the weld in real time. The configuration of the features in the feature space is discussed and Laplacian support vector machine is applied to recognise different quality. The classification performance is additionally enhanced by a multi-kernel that adapts the remapping into higher dimensional space to the geometry of the given features, recovered by graph Laplacian. The multi kernel is constructed as a Gaussian mixture. The algorithm incorporates two optimization stages that are built into the training process of the classifier. The performance is tested using a synthetic datasets and is compared with the one of Laplacian support vector machine. The results from real-life process are also presented. The recovered geometry of the features is discussed with respect to the underline physical phenomena. The usage of machine learning for better understanding of the process physics is discussed.

## Learning to Run - Deep Reinforcement Learning with Musculoskeletal Models (Sharada Prasanna Mohanty, EPFL)

Our movement originates in the brain. Many neurological disorders, such as Cerebral Palsy, Multiple Sclerosis, or strokes can lead to problems with walking. Treatments are often symptomatic, and it's often hard to predict outcomes of surgeries. Understanding underlying mechanisms is key to improvement of treatments. This motivates our efforts to model the motor control unit of the brain.

In this problem, the task is to model the motor control unit in a simulated virtual environment. Given a musculoskeletal model with 18 muscles to control, at every 10ms you can send signals to these muscles to activate or deactivate them. The objective is to walk/run as far as possible in 5 seconds. We use OpenSim - a biomechanical physics environment for musculoskeletal simulations; and explore TRPO and PPO to demonstrate the feasibility of finding interesting policies using reinforcement learning.

We are also running a challenge on the same problem on CrowdAI in context of the NIPS 2017 Challenge Track. In this talk we demonstrate our baseline results, and also explore results submitted by participants on the CrowdAI for the NIPS Challenge.

## DroNet: Learning to Fly by Driving (Antonio Loquercio, UZH)

Civilian drones are soon expected to be used in a wide variety of tasks, such as aerial surveillance, delivery, or monitoring of existing architectures. Nevertheless, their deployment in urban environments has so far been limited. Indeed, in unstructured and highly dynamic scenarios drones face numerous challenges to navigate autonomously in a feasible and safe way. To cope with the above challenges, this paper proposes DroNet, a convolutional neural network that can safely drive a drone through the streets of a city.

Designed as a fast 8-layers residual network, DroNet produces, for each single input image, two outputs: a steering angle, to keep the drone navigating while avoiding obstacles, and a collision probability, to let the UAV recognize dangerous situations and promptly react to them. But how to collect enough data in an unstructured outdoor environment, such as a city?

Clearly, having an expert pilot providing training trajectories is not an option given the large amount of data required and, above all, the risk that it involves for others vehicles or pedestrians moving in the streets. Therefore, we propose to train a UAV from data collected by cars and bicycles, which, already integrated into urban environments, would expose other cars and pedestrians to no danger.

Although trained on city streets, from the viewpoint of urban vehicles, the navigation policy learned by DroNet is highly generalizable. Indeed, it allows a UAV to successfully fly at relative high altitudes, and even in indoor environments, such as parking lots and corridors.

## Learning Multi-Modal Problems in Computer Vision (Pierre Baqué, EPFL)

Current state-of-the-art Machine Learning algorithm, used in Computer Vision, often predict marginal distributions over sets of variables. Even though this information is often the one of interest, sampling from marginals can give a very bad prediction when the true joint posterior is multi-modal.

We will see how Conditional Random Fields can be embedded into Deep architecture in order to partially alleviate this problem. We will then demonstrate how the standard Back-Mean Field technique can be extended to a multi-modal version to better learn such models end-to-end.

We will illustrate applications of our methods on several Computer Vision and Machine Learning problems.

## Lifelong Generative Modeling (Jason Ramapuram, UNIGE)

Lifelong learning is the problem of learning multiple consecutive tasks in an online manner and is essential towards the development of intelligent machines that can adapt to their surroundings. In this work we focus on learning a lifelong approach to generative modeling whereby we continuously incorporate newly observed distributions into our model representation. We utilize two models, aptly named the student and the teacher, in order to aggregate information about all past distributions without the preservation of any of the past data or previous models. The teacher is utilized as a form of compressed memory in order to allow for the student model to learn over the past as well as present data. We demonstrate why a naive approach to lifelong generative modeling fails and introduce a regularizer with which we demonstrate learning across a long range of distributions.

## Representation Learning by Learning to Count (Mehdi Noroozi, UNIBE)

We introduce a novel method for representation learning that uses an artificial supervision signal based on counting visual primitives. This supervision signal is obtained from an equivariance relation, which does not require any manual annotation. We relate transformations of images to transformations of the representations. More specifically, we look for the representation that satisfies such relation rather than the transformations that match a given representation. In this talk, we use two image transformations in the context of counting: scaling and tiling. The first transformation exploits the fact that the number of visual primitives should be invariant to scale. The second transformation allows us to equate the total number of visual primitives in each tile to that in the whole image. These two transformations are combined in one constraint and used to train a neural network with a contrastive loss. The proposed task produces representations that perform on par or exceed the state of the art in transfer learning benchmarks.

## Improving learning in robotics by exploiting the structure and geometry of the data: application to prosthetic hands (Noémie Jaquier, IDIAP)

In many sensing and control applications, data are represented in the form of multidimensional arrays with particular geometric properties such as symmetries. Considering the underlying structure and geometry of the data can be beneficial in many robotics applications. This is particularly important in the context of robot learning from demonstration, when a small set of demonstrations is available, i.e. when the robots are required to reproduce tasks, with generalization to new situations.

I will present different regression techniques and their extension to models that can efficiently exploit the specific structure and geometry of the acquired datapoints. The approaches will be illustrated in the application of controlling a prosthetic hand using tactile and electromyography data to predict hand movements.

## A Deep Learning Approach to Ultrasound Image Recovery (Dimitris Perdios, EPFL)

Based on the success of deep neural networks for image recovery, we propose a new paradigm for the compression and decompression of ultrasound (US) signals which relies on stacked denoising autoencoders. The first layer of the network is used to compress the signals and the remaining layers perform the reconstruction. We train the network on simulated US signals and evaluate its quality on images of the publicly available PICMUS dataset. We demonstrate that such a simple architecture outperforms state-of-the art methods, based on the compressed sensing framework, both in terms of image quality and computational complexity.

## Learning non-linear transform with discriminative and minimum information loss priors (Dimche Kostadinov, UNIGE)

This work proposes learning a non-linear transform with two priors. The first is a discriminative prior defined using a measure on a support intersection and the second is a minimum information loss prior expressed as a constraint on the conditioning and the coherence. An approximation of the measure for the discriminative prior is addressed, connecting it to a similarity concentration. Along quantifying the discriminative properties of the transform representation a sensitivity analysis of the similarity concentration w.r.t. the parameters of the non-linear transform is given. Furthermore, a measure, related to the similarity concentration, reflecting the discriminative properties, named as discriminative power is introduced and its bounds are presented.

The advantages and the potential of the proposed algorithm are evaluated by a computer simulation.