# Variational Inference: Bayesian Neural Networks¶

- 2017 by Thomas Wiecki, updated by Maxim Kochurov

Original blog post: http://twiecki.github.io/blog/2016/06/01/bayesian-deep-learning/

## Current trends in Machine Learning¶

There are currently three big trends in machine learning:
**Probabilistic Programming**, **Deep Learning** and “**Big Data**”.
Inside of PP, a lot of innovation is in making things scale using
**Variational Inference**. In this blog post, I will show how to use
**Variational Inference** in
PyMC3 to fit a simple Bayesian
Neural Network. I will also discuss how bridging Probabilistic
Programming and Deep Learning can open up very interesting avenues to
explore in future research.

### Probabilistic Programming at scale¶

**Probabilistic Programming** allows very flexible creation of custom
probabilistic models and is mainly concerned with **insight** and
learning from your data. The approach is inherently **Bayesian** so we
can specify **priors** to inform and constrain our models and get
uncertainty estimation in form of a **posterior** distribution. Using
MCMC sampling
algorithms
we can draw samples from this posterior to very flexibly estimate these
models. PyMC3 and
Stan are the current state-of-the-art tools to
consruct and estimate these models. One major drawback of sampling,
however, is that it’s often very slow, especially for high-dimensional
models. That’s why more recently, **variational inference** algorithms
have been developed that are almost as flexible as MCMC but much faster.
Instead of drawing samples from the posterior, these algorithms instead
fit a distribution (e.g. normal) to the posterior turning a sampling
problem into and optimization problem.
ADVI – Automatic Differentation
Variational Inference – is implemented in
PyMC3 and
Stan, as well as a new package called
Edward which is mainly
concerned with Variational Inference.

Unfortunately, when it comes to traditional ML problems like classification or (non-linear) regression, Probabilistic Programming often plays second fiddle (in terms of accuracy and scalability) to more algorithmic approaches like ensemble learning (e.g. random forests or gradient boosted regression trees.

### Deep Learning¶

Now in its third renaissance, deep learning has been making headlines repeatadly by dominating almost any object recognition benchmark, kicking ass at Atari games, and beating the world-champion Lee Sedol at Go. From a statistical point, Neural Networks are extremely good non-linear function approximators and representation learners. While mostly known for classification, they have been extended to unsupervised learning with AutoEncoders and in all sorts of other interesting ways (e.g. Recurrent Networks, or MDNs to estimate multimodal distributions). Why do they work so well? No one really knows as the statistical properties are still not fully understood.

A large part of the innoviation in deep learning is the ability to train these extremely complex models. This rests on several pillars: * Speed: facilitating the GPU allowed for much faster processing. * Software: frameworks like Theano and TensorFlow allow flexible creation of abstract models that can then be optimized and compiled to CPU or GPU. * Learning algorithms: training on sub-sets of the data – stochastic gradient descent – allows us to train these models on massive amounts of data. Techniques like drop-out avoid overfitting. * Architectural: A lot of innovation comes from changing the input layers, like for convolutional neural nets, or the output layers, like for MDNs.

### Bridging Deep Learning and Probabilistic Programming¶

On one hand we Probabilistic Programming which allows us to build rather small and focused models in a very principled and well-understood way to gain insight into our data; on the other hand we have deep learning which uses many heuristics to train huge and highly complex models that are amazing at prediction. Recent innovations in variational inference allow probabilistic programming to scale model complexity as well as data size. We are thus at the cusp of being able to combine these two approaches to hopefully unlock new innovations in Machine Learning. For more motivation, see also Dustin Tran’s recent blog post.

While this would allow Probabilistic Programming to be applied to a much
wider set of interesting problems, I believe this bridging also holds
great promise for innovations in Deep Learning. Some ideas are: *
**Uncertainty in predictions**: As we will see below, the Bayesian
Neural Network informs us about the uncertainty in its predictions. I
think uncertainty is an underappreciated concept in Machine Learning as
it’s clearly important for real-world applications. But it could also be
useful in training. For example, we could train the model specifically
on samples it is most uncertain about. * **Uncertainty in
representations**: We also get uncertainty estimates of our weights
which could inform us about the stability of the learned representations
of the network. * **Regularization with priors**: Weights are often
L2-regularized to avoid overfitting, this very naturally becomes a
Gaussian prior for the weight coefficients. We could, however, imagine
all kinds of other priors, like spike-and-slab to enforce sparsity (this
would be more like using the L1-norm). * **Transfer learning with
informed priors**: If we wanted to train a network on a new object
recognition data set, we could bootstrap the learning by placing
informed priors centered around weights retrieved from other pre-trained
networks, like GoogLeNet. *
**Hierarchical Neural Networks**: A very powerful approach in
Probabilistic Programming is hierarchical modeling that allows pooling
of things that were learned on sub-groups to the overall population (see
my tutorial on Hierarchical Linear Regression in
PyMC3).
Applied to Neural Networks, in hierarchical data sets, we could train
individual neural nets to specialize on sub-groups while still being
informed about representations of the overall population. For example,
imagine a network trained to classify car models from pictures of cars.
We could train a hierarchical neural network where a sub-neural network
is trained to tell apart models from only a single manufacturer. The
intuition being that all cars from a certain manufactures share certain
similarities so it would make sense to train individual networks that
specialize on brands. However, due to the individual networks being
connected at a higher layer, they would still share information with the
other specialized sub-networks about features that are useful to all
brands. Interestingly, different layers of the network could be informed
by various levels of the hierarchy – e.g. early layers that extract
visual lines could be identical in all sub-networks while the
higher-order representations would be different. The hierarchical model
would learn all that from the data. * **Other hybrid architectures**:
We can more freely build all kinds of neural networks. For example,
Bayesian non-parametrics could be used to flexibly adjust the size and
shape of the hidden layers to optimally scale the network architecture
to the problem at hand during training. Currently, this requires costly
hyper-parameter optimization and a lot of tribal knowledge.

## Bayesian Neural Networks in PyMC3¶

### Generating data¶

First, lets generate some toy data – a simple binary classification problem that’s not linearly separable.

```
In [1]:
```

```
%matplotlib inline
import theano
floatX = theano.config.floatX
import pymc3 as pm
import theano.tensor as T
import sklearn
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from warnings import filterwarnings
filterwarnings('ignore')
sns.set_style('white')
from sklearn import datasets
from sklearn.preprocessing import scale
from sklearn.cross_validation import train_test_split
from sklearn.datasets import make_moons
```

```
In [2]:
```

```
X, Y = make_moons(noise=0.2, random_state=0, n_samples=1000)
X = scale(X)
X = X.astype(floatX)
Y = Y.astype(floatX)
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=.5)
```

```
In [3]:
```

```
fig, ax = plt.subplots()
ax.scatter(X[Y==0, 0], X[Y==0, 1], label='Class 0')
ax.scatter(X[Y==1, 0], X[Y==1, 1], color='r', label='Class 1')
sns.despine(); ax.legend()
ax.set(xlabel='X', ylabel='Y', title='Toy binary classification data set');
```

### Model specification¶

A neural network is quite simple. The basic unit is a perceptron which is nothing more than logistic regression. We use many of these in parallel and then stack them up to get hidden layers. Here we will use 2 hidden layers with 5 neurons each which is sufficient for such a simple problem.

```
In [4]:
```

```
def construct_nn(ann_input, ann_output):
n_hidden = 5
# Initialize random weights between each layer
init_1 = np.random.randn(X.shape[1], n_hidden).astype(floatX)
init_2 = np.random.randn(n_hidden, n_hidden).astype(floatX)
init_out = np.random.randn(n_hidden).astype(floatX)
with pm.Model() as neural_network:
# Weights from input to hidden layer
weights_in_1 = pm.Normal('w_in_1', 0, sd=1,
shape=(X.shape[1], n_hidden),
testval=init_1)
# Weights from 1st to 2nd layer
weights_1_2 = pm.Normal('w_1_2', 0, sd=1,
shape=(n_hidden, n_hidden),
testval=init_2)
# Weights from hidden layer to output
weights_2_out = pm.Normal('w_2_out', 0, sd=1,
shape=(n_hidden,),
testval=init_out)
# Build neural-network using tanh activation function
act_1 = pm.math.tanh(pm.math.dot(ann_input,
weights_in_1))
act_2 = pm.math.tanh(pm.math.dot(act_1,
weights_1_2))
act_out = pm.math.sigmoid(pm.math.dot(act_2,
weights_2_out))
# Binary classification -> Bernoulli likelihood
out = pm.Bernoulli('out',
act_out,
observed=ann_output,
total_size=Y_train.shape[0] # IMPORTANT for minibatches
)
return neural_network
# Trick: Turn inputs and outputs into shared variables.
# It's still the same thing, but we can later change the values of the shared variable
# (to switch in the test-data later) and pymc3 will just use the new data.
# Kind-of like a pointer we can redirect.
# For more info, see: http://deeplearning.net/software/theano/library/compile/shared.html
ann_input = theano.shared(X_train)
ann_output = theano.shared(Y_train)
neural_network = construct_nn(ann_input, ann_output)
```

That’s not so bad. The `Normal`

priors help regularize the weights.
Usually we would add a constant `b`

to the inputs but I omitted it
here to keep the code cleaner.

### Variational Inference: Scaling model complexity¶

We could now just run a MCMC sampler like
``NUTS`

<http://pymc-devs.github.io/pymc3/api.html#nuts>`__ which
works pretty well in this case but as I already mentioned, this will
become very slow as we scale our model up to deeper architectures with
more layers.

Instead, we will use the brand-new
ADVI variational
inference algorithm which was recently added to `PyMC3`

, and updated
to use the operator variational inference (OPVI) framework. This is much
faster and will scale better. Note, that this is a mean-field
approximation so we ignore correlations in the posterior.

```
In [5]:
```

```
from pymc3.theanof import set_tt_rng, MRG_RandomStreams
set_tt_rng(MRG_RandomStreams(42))
```

We run ADVI to estimate posterior means, standard deviations, and the
evidence lower bound (ELBO). The `cost_part_grad_scale`

argument is
specified to reduce the variance of the gradient in the
model.

```
In [6]:
```

```
%%time
with neural_network:
s = theano.shared(pm.floatX(1))
inference = pm.ADVI(cost_part_grad_scale=s)
# ADVI has nearly converged
pm.fit(n=20000, method=inference)
# It is time to set `s` to zero
s.set_value(0)
approx = pm.fit(n=30000)
```

```
Average Loss = 206.08: 100%|██████████| 20000/20000 [00:04<00:00, 4420.54it/s]
Finished [100%]: Average Loss = 206.04
Average Loss = 198.47: 100%|██████████| 30000/30000 [00:07<00:00, 4218.28it/s]
Finished [100%]: Average Loss = 198.42
```

```
CPU times: user 15.2 s, sys: 654 ms, total: 15.9 s
Wall time: 16.4 s
```

< 20 seconds on my older laptop. That’s pretty good considering that NUTS is having a really hard time. Further below we make this even faster. To make it really fly, we probably want to run the Neural Network on the GPU.

As samples are more convenient to work with, we can very quickly draw
samples from the variational approximation using the `sample`

method
(this is just sampling from Normal distributions, so not at all the same
like MCMC):

```
In [7]:
```

```
trace = approx.sample(draws=5000)
```

Plotting the objective function (ELBO) we can see that the optimization slowly improves the fit over time.

```
In [13]:
```

```
plt.plot(-inference.hist)
plt.ylabel('ELBO')
plt.xlabel('iteration')
```

```
Out[13]:
```

```
<matplotlib.text.Text at 0x12f368a58>
```

Now that we trained our model, lets predict on the hold-out set using a posterior predictive check (PPC).

- We can use
``sample_ppc()`

<http://pymc-devs.github.io/pymc3/api.html#pymc3.sampling.sample_ppc>`__ to generate new data (in this case class predictions) from the posterior (sampled from the variational estimation). - It is better to get the node directly and build theano graph using
our approximation (
`approx.sample_node`

) , we get a lot of speed up

```
In [15]:
```

```
# create symbolic input
x = T.matrix('X')
# symbolic number of samples is supported, we build vectorized posterior on the fly
n = T.iscalar('n')
# Do not forget test_values or set theano.config.compute_test_value = 'off'
x.tag.test_value = np.empty_like(X_train[:10])
n.tag.test_value = 100
_sample_proba = approx.sample_node(neural_network.out.distribution.p, size=n,
more_replacements={ann_input:x})
```

We can now compile the function, which uses an efficient vectorized form of sampling.

```
In [17]:
```

```
sample_proba = theano.function([x, n], _sample_proba)
```

Let’s create a couple of benchmark functions

```
In [18]:
```

```
def production_step1():
ann_input.set_value(X_test)
ann_output.set_value(Y_test)
ppc = pm.sample_ppc(trace, model=neural_network, samples=500, progressbar=False)
# Use probability of > 0.5 to assume prediction of class 1
pred = ppc['out'].mean(axis=0) > 0.5
def production_step2():
sample_proba(X_test, 500).mean(0) > 0.5
```

```
In [19]:
```

```
%timeit production_step1()
```

```
185 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
```

```
In [20]:
```

```
%timeit production_step2()
```

```
43.4 ms ± 5.1 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
```

Let’s go ahead and generate predictions:

```
In [21]:
```

```
pred = sample_proba(X_test, 500).mean(0) > 0.5
```

```
In [22]:
```

```
fig, ax = plt.subplots()
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
sns.despine()
ax.set(title='Predicted labels in testing set', xlabel='X', ylabel='Y');
```

```
In [23]:
```

```
print('Accuracy = {}%'.format((Y_test == pred).mean() * 100))
```

```
Accuracy = 87.0%
```

Hey, our neural network did all right!

## Lets look at what the classifier has learned¶

For this, we evaluate the class probability predictions on a grid over the whole input space.

```
In [24]:
```

```
grid = np.mgrid[-3:3:100j,-3:3:100j].astype(floatX)
grid_2d = grid.reshape(2, -1).T
dummy_out = np.ones(grid.shape[1], dtype=np.int8)
```

```
In [25]:
```

```
ppc = sample_proba(grid_2d ,500)
```

### Probability surface¶

```
In [27]:
```

```
cmap = sns.diverging_palette(250, 12, s=85, l=25, as_cmap=True)
fig, ax = plt.subplots(figsize=(16, 9))
contour = ax.contourf(grid[0], grid[1], ppc.mean(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Posterior predictive mean probability of class label = 0');
```

### Uncertainty in predicted value¶

So far, everything I showed we could have done with a non-Bayesian Neural Network. The mean of the posterior predictive for each class-label should be identical to maximum likelihood predicted values. However, we can also look at the standard deviation of the posterior predictive to get a sense for the uncertainty in our predictions. Here is what that looks like:

```
In [28]:
```

```
cmap = sns.cubehelix_palette(light=1, as_cmap=True)
fig, ax = plt.subplots(figsize=(16, 9))
contour = ax.contourf(grid[0], grid[1], ppc.std(axis=0).reshape(100, 100), cmap=cmap)
ax.scatter(X_test[pred==0, 0], X_test[pred==0, 1])
ax.scatter(X_test[pred==1, 0], X_test[pred==1, 1], color='r')
cbar = plt.colorbar(contour, ax=ax)
_ = ax.set(xlim=(-3, 3), ylim=(-3, 3), xlabel='X', ylabel='Y');
cbar.ax.set_ylabel('Uncertainty (posterior predictive standard deviation)');
```

We can see that very close to the decision boundary, our uncertainty as to which label to predict is highest. You can imagine that associating predictions with uncertainty is a critical property for many applications like health care. To further maximize accuracy, we might want to train the model primarily on samples from that high-uncertainty region.

## Mini-batch ADVI: Scaling data size¶

So far, we have trained our model on all data at once. Obviously this won’t scale to something like ImageNet. Moreover, training on mini-batches of data (stochastic gradient descent) avoids local minima and can lead to faster convergence.

Fortunately, ADVI can be run on mini-batches as well. It just requires some setting up:

```
In [29]:
```

```
# Generator that returns mini-batches in each iteration
def create_minibatch(data):
rng = np.random.RandomState(0)
while True:
# Return random data samples of set size 100 each iteration
ixs = rng.randint(len(data), size=50)
yield data[ixs]
```

# Minibatch ADVI¶

To train the model with minibatches, we just need to wrap python
generators with the PyMC `generator`

function.

```
In [30]:
```

```
minibatch_x = pm.generator(create_minibatch(X_train))
minibatch_y = pm.generator(create_minibatch(Y_train))
neural_network_minibatch = construct_nn(minibatch_x, minibatch_y)
with neural_network_minibatch:
approx = pm.fit(40000, method=pm.ADVI())
```

```
Average Loss = 137.26: 100%|██████████| 40000/40000 [00:07<00:00, 5490.28it/s]
Finished [100%]: Average Loss = 137.12
```

```
In [31]:
```

```
plt.plot(inference.hist)
plt.ylabel('ELBO')
plt.xlabel('iteration');
```

As you can see, mini-batch ADVI’s running time is much lower. It also seems to converge faster.

For fun, we can also look at the trace. The point is that we also get uncertainty of our Neural Network weights.

```
In [32]:
```

```
pm.traceplot(trace);
```

## Summary¶

Hopefully this blog post demonstrated a very powerful new inference algorithm available in PyMC3: ADVI. I also think bridging the gap between Probabilistic Programming and Deep Learning can open up many new avenues for innovation in this space, as discussed above. Specifically, a hierarchical neural network sounds pretty bad-ass. These are really exciting times.

## Next steps¶

``Theano`

<http://deeplearning.net/software/theano/>`__, which is used
by `PyMC3`

as its computational backend, was mainly developed for
estimating neural networks and there are great libraries like
``Lasagne`

<https://github.com/Lasagne/Lasagne>`__ that build on top
of `Theano`

to make construction of the most common neural network
architectures easy. Ideally, we wouldn’t have to build the models by
hand as I did above, but use the convenient syntax of `Lasagne`

to
construct the architecture, define our priors, and run ADVI.

You can also run this example on the GPU by setting `device = gpu`

and
`floatX = float32`

in your `.theanorc`

.

You might also argue that the above network isn’t really deep, but note that we could easily extend it to have more layers, including convolutional ones to train on more challenging data sets.

I also presented some of this work at PyData London, view the video below:

Finally, you can download this NB here. Leave a comment below, and follow me on twitter.

## Acknowledgements¶

Taku Yoshioka did a lot of work on ADVI in PyMC3, including the mini-batch implementation as well as the sampling from the variational posterior. I’d also like to the thank the Stan guys (specifically Alp Kucukelbir and Daniel Lee) for deriving ADVI and teaching us about it. Thanks also to Chris Fonnesbeck, Andrew Campbell, Taku Yoshioka, and Peadar Coyle for useful comments on an earlier draft.