Last week, I attended NIPS (Neural Information Processing Systems) 2017 conference in Long Beach, CA. This was my first time attending NIPS.
What is NIPS?
NIPS is one of the largest machine learning (ML) / artificial intelligence (AI) conferences in the world. NIPS conference consists of three programs: tutorials, main event (which includes symposium) and workshops. This year, there were about 8,000 attendees, about 30% increase in attendance compared to last year, showing the growing interest in machine learning.
I attended only the workshop and symposium, but based on their contents, I can see that workshop is a specialized track, which dedicates to a single topic for an entire day. Workshop consists of 5-6 talks interleaved with poster sessions. Most workshop also had panel discussions at the end. I attended the workshops about Bayesian ML and symposium talks about interpretable ML.
Before talking about the conference talks, I first want you to look at the sponsor list because it can show the current landscape of machine learning in industry. To name a few sponsors, Microsoft, IBM Research, Audi, Intel Nervana werre the Diamond sponsors ($80k sponsorship) followed by the Platinum sponsors ($40k) such as Apple, Uber, Facebook, Google, Baidu, Alibaba, Amazon, DeepMind, etc. For the comprehensive list, see here.
The workshop was scheduled for two days. Each workshop was an all-day track, which consisted of talks from academia and industry and poster presentations. This year, there were about 50 workshops. The topics covered a broad spectrum of material in terms of discipline and techniques/methodology.
In terms of the discipline, there were workshops dedicated to academic domains,
- Physical sciences
- IoT devices
- Software development
and more human-related fields
- AI transparency/fairness
- ML applications in developing world
- Creativity and design
The topics in terms of ML techniques varied widely as well;
- Deep learning (general)
- Deep reinforcement learning
- Bayesian Deep learning
- Semi-supervised learning
- Causal inference
- Natural language processing
- Audio signal processing
- Time series
- Visualization and communication
Symposium: Interpretable machine learning
I arrived a day before the workshop and was able to attend the symposium on interpretable ML.
Interpretable ML is beneficial not only for helping users understand the outcome of ML models but also for making AI safe and accountable. However, interpretable ML is challenging. In general, the presenters agreed it is difficult to define the term “interpretability” because it is a human-centric term and relies on the type of users, which may even involve user experience aspect. Plus, there are limitations in the current interpretable-ML approaches. For now, there are roughly two ways to build an interpretable ML algorithm: 1) using a simple (e.g., linear) algorithm or 2) using a “mimic” algorithm that describes a black box model locally (e.g., LIME). Both have shortcomings; simple algorithms are likely to have lower accuracy and the mimic algorithms can fail easily if data have complicate distributions. Regardless, talks in this session made an attempt to address the problems in current ML landscape and to examine different elements of interpretable ML.
To use interpretable ML in image classification/recognition task, Kiri Wagstaff (JPL) showed how to visualize the interpretability and learning process in deep neural network by using DEMUD, which learns the residuals (difference) between consecutive information. Kilian Q. Weinberger (Cornell) raised a concern on modern neural networks, which have badly calibrated confidence (i.e., overly confident), which can be a problem in practical applications such as autonomous vehicle and automated medical diagnosis.
Jenn Wortman Vaughan (Microsoft) suggested that interpretability should be understood across disciplines and humans should be considered; depending on users, interpretability may have different meanings. Vaughan conducted an interesting experiment using Amazon Mechanical Turk, where she tested whether human subjects had better understanding of simpler models than more complicated models (e.g., black box, more features). The subjects were able to follow the simpler model better although they had similar level of confidence on both models (opposed to the common belief that people trust black box models less than simple models). This was a preliminary work but it was simple yet novel because this study attempted to actually include humans and measure their response on ML models and their predictions.
There was also a panel discussion where presenters answered a series of questions. First, they acknowledged that defining “interpretability” is difficult. They agreed that Bayesian methods can assist black box models by providing some information on relationship between factors and uncertainty of prediction. They also confirmed that industry domains with high risk such as finance and medicine are more resistant to black box models and prefer interpretability. There were also some criticism on “mimic models” because they do not provide a fundamental explanation but more post hoc interpretation. To compare different methods of interpretable models, researchers agreed that experiments involving human would be required. Overall, they agreed that to make a breakthrough in this problem, systematic human experiments and better definition of interpretability should come first.
There were about 50 workshops this year and there were a lot of interesting topics. It was difficult to choose which one to attend but I decided to go with the ones on Bayesian machine learning. My motivations behind this decision were 1) I studied Bayesian inference during my PhD and I am slightly familiar with the topic, 2) I’ve always been on the skeptical side on deep learning and the Bayesian treatment of deep learning sounded like a remedy for the current hype in deep learning, 3) Bayesian machine learning is an emerging topic in machine learning. Although both workshops were heavy on technical details (as in most Bayesian inference material), I’d like to summarize a few key takeaways.
Although there were remarkable advances in deep learning, these deep neural networks lack the ability to address uncertainty in their predictions and do not utilize probability theory. Hence, recently researchers have been trying to merge Bayesian approach and deep learning. The main challenge of Bayesian inference is to approximate intractable marginal probability distribution when computing posterior probability. There are two ways to solve this; one is devising an easy-to-estimate probability distribution which approximates the true distribution and trying to reduce the distance between these two distributions and the other is to build the posterior using sampling method (Monte Carlo simulation). The first method is often called variational inference (VI). Probably, the most popular method in the second method is Markov Chain Monte Carlo (MCMC). Even though each method has its own pros and cons, most talks were interested in VI than the sampling methods mostly because VI is faster and deterministic.
Many novel approaches were presented in the approximate Bayesian inference workshop. Josip Djolonga (ETH Zürich) suggested an approximation method that uses the idea of fooling a two-sample statistical test; “if we can fool a statistical test, the learned distribution should be a good model of the true data.” Yingzhen Li presented the idea of directly estimating the score function instead of approximating optimization objective for gradient-based optimization, which can alleviate pitfalls of generalized adversarial network (GAN) such as overfitting and underestimating the loss function. Another brilliant was given by Kira Kempinska (University College London), who introduced the adversarial sequential Monte Carlo method, which formulates the approximation problem as a two-player game, similar to GAN.
Invited talks from industry were interesting as well. Dawen Liang from Netflix showed how we can build a recommender system using variational autoencoder (VAE) especially since the problem is more of a small data problem and user-item interaction matrix is only observed if it is positive (i.e., there is no distinction between negative and none). Andreas Damianou from Amazon introduced deep Gaussian process, where the motivation is, (again) taking advantage of including inference (i.e., models should behave in a different way depending on how certain they are about their predictions) in current deep learning methods by better approximation of intractable distributions.
Day 2: Bayesian deep learning
Thematically, this workshop was very similar to the first one. However, here the presenters showed a slightly bigger picture by focusing on probabilistic programming and Bayesian neural network (not just the approximation method).
The session started with a talk by Dustin Tran (Columbia University / Google), the lead developer of a probabilistic programming library, Edward. His talk was about the library and general overview of probabilistic programming. Edward allows Bayesian inference by supporting directed graphical models. It is built on TensorFlow and even though I haven’t used Edward, based on the code screenshot in the presentation, the syntax seemed very similar to pymc3, which I’ve used before; you define graph structure of your variables, whether they are stochastic or deterministic, which distribution these variables should have, etc. It supports both variational inference and Monte Carlo methods. Dustin predicted that even though probabilistic programming may have high cognitive burden, it will become much easier to use thanks to distributed, complied and accelerated systems, which allows probabilistic programming over multiple machines.
The special talk of this session was given by Max Welling (University of Amsterdam / Qualcomm), who published numerious studies on Bayesian machine learning and is famous for his paper on variational autoencoder. He gave a general overview of Bayesain deep learning. First, he opened his talk by reiterating the benefits of Bayesian approach: 1) we can regularize models with a principle but without wasting data (cross-validation is not needed), 2) we can measure uncertainty, and 3) Bayesian approach can be used for rigorous model selection. He then raised an open-ended question; what should we do with quantified uncertainty when making a decision? For me it sounded like, measure uncertainty and practical use of this quantified uncertainty might be two different things. At the end of the talk, he mentioned there are three lines of work in Bayesian deep learning; 1) deep Gaussian process, 2) information bottleneck, and 3) Bayesian dropout. Regarding the last one, he mentioned it turned out the fast dropout method was actually not a Bayesian approach and researchers are working on full-Bayesian methods.
An interesting technical talk was given by Gintare Karolina Dziugaite (University of Cambridge / Vector Institute), where she and Daniel M. Roy (University of Toronto / Vector Institute) devised a modified stochastic gradient descent (SGD) method, namely “entropy-SGD” to mitigate the overfitting problem of SGD . This problem has been shown by Koh and Liang, whose publication was selected as the best paper in this year’s International Conference of Machine Learning (ICML). In the paper by Koh and Liang, it turned out deep neural networks with SGD can learn completely randomized labels in training data.
Even though I attended only a half of the full conference, I can say that I experienced two seemingly very different domains in the current landscape of machine learning in NIPS. The symposium on interpretable ML was more on practical application and its effect on people. The workshops focused on the techniques on injecting Bayesian flavor in modern neural networks. Even though these look quite different on the surface, I actually think there is convergence between these two: researchers eventually want to build a “smarter” machine by extracting more information from a model (uncertainty), which can be more accountable and transparent to humans.