Nonstandard Interpretations of ProbabilisticPrograms for Efﬁcient Inference
David Wingate
BCS / LIDS, MITwingated@mit.edu
Noah D. Goodman
Psychology, Stanfordngoodman@stanford.edu
Andreas Stuhlm¨uller
BCS, MITast@mit.edu
Jeffrey M. Siskind
ECE, Purdueqobi@purdue.edu
Abstract
Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because theprogram is in machinereadable format, a variety of techniques from compiler design and program analysis can be used to examine the structure of the distributionrepresented by the probabilistic program. We show how
nonstandard interpretations
of probabilistic programs can be used to craft efﬁcient inference algorithms:information about the structure of a distribution (such as gradients or dependencies) is generated as a monadlike side computation while executing the program.These interpretations can be easily coded using specialpurpose objects and operator overloading. We implement two examples of nonstandard interpretations intwo different languages, and use them as building blocks to construct inferencealgorithms: automatic differentiation, which enables gradient based methods, andprovenance tracking, which enables efﬁcient construction of global proposals.
1 Introduction
Probabilistic programming simpliﬁes the development of probabilistic models by allowing modelersto specify a stochastic process using syntax that resembles modern programming languages. Theselanguages permit arbitrary mixing of deterministic and stochastic elements, resulting in tremendousmodeling ﬂexibility. The resulting programs deﬁne probabilistic models that serve as prior distributions: running the (unconditional) program forward many times results in a distribution over execution traces, with each trace being a sample from the prior. Examples include
BLOG
[13], BayesianLogic Programs [10]
IBAL
[18], C
HURCH
[6], Stochastic M
ATLAB
[28], and
HANSEI
[11].The primary challenge in developing such languages is scalable inference. Inference can be viewedas reasoning about the posterior distribution over execution traces conditioned on a particular program output, and is difﬁcult because of the ﬂexibility these languages present: in principle, aninference algorithm must behave reasonably for any program a user wishes to write. SamplebasedMCMC algorithms are the stateoftheart method, due to their simplicity, universality, and compositionality. But in probabilistic modeling more generally, efﬁcient inference algorithms are designedby taking advantage of structure in distributions. How can we ﬁnd structure in a distribution deﬁned by a probabilistic program? A key observation is that some languages, such as C
HURCH
andStochastic M
ATLAB
, are deﬁned in terms of an existing (nonprobabilistic) language. Programs inthese languages may literally be executed in their native environments—suggesting that tools fromprogram analysis and programming language theory can be leveraged to ﬁnd and exploit structurein the program for inference, much as a compiler might ﬁnd and exploit structure for performance.Here, we show how
nonstandard interpretations
of probabilistic programs can help craft efﬁcientinference algorithms. Information about the structure of a distribution (such as gradients, dependencies or bounds) is generated as a monadlike side computation while executing the program. Thisextra information can be used to, for example, construct good MH proposals, or search efﬁcientlyfor a local maximum. We focus on two such interpretations: automatic differentiation and provenance tracking, and show how they can be used as building blocks to construct efﬁcient inference1
algorithms. We implement nonstandard interpretations in two different languages (C
HURCH
andStochastic M
ATLAB
), and experimentally demonstrate that while they typically incur some additional execution overhead, they dramatically improve inference performance.
2 Background and Related Work
Alg. 1: A GaussianGamma mixture
1:
for i=1:1000
2:
if ( rand
>
0.5 )
3:
X(i) = randn;
4:
else
5:
X(i) = gammarnd;
6:
end;
7:
end;We begin by outlining our setup, following [28]. We deﬁne an unconditioned probabilistic program to be a parameterless function
f
with an arbitrary mix of stochastic and deterministic elements (hereafter, we will use theterm function and program interchangeably). The function
f
may be written in any language, but our runningexample will be M
ATLAB
. We allow the function to bearbitrarily complex inside, using any additional functions,recursion, language constructs or external libraries it wishes. The only constraint is that the function must be selfcontained, with no external sideeffects which would impact the execution of thefunction from one run to another.The stochastic elements of
f
must come from a set of known, ﬁxed
elementary random primitives
,or ERPs. Complex distributions are constructed compositionally, using ERPs as building blocks. InM
ATLAB
, ERPs may be functions such as
rand
(sample uniformly from [0,1]) or
randn
(samplefrom a standard normal). Higherorder random primitives, such as nonparametric distributions, mayalso be deﬁned, but must be ﬁxed ahead of time. Formally, let
T
be the set of ERP types. We assumethat each type
t
∈ T
is a parametric family of distributions
p
t
(
x

θ
t
)
, with parameters
θ
t
.Now, consider what happens while executing
f
. As
f
is executed, it encounters a series of ERPs.Alg. 1 shows an example of a simple
f
written in M
ATLAB
with three syntactic ERPs:
rand
,
randn
, and
gammarnd
. During execution, depending on the return value of each call to
rand
,different paths will be taken through the program, and different ERPs will be encountered. We callthis path an
execution trace.
A total of 2000 random choices will be made when executing this
f
.Let
f
k

x
1
,
···
,x
k
−
1
be the
k
’th ERP encountered while executing
f
, and let
x
k
be the value it returns.Note that the parameters passed to the
k
’th ERP may change depending on previous
x
k
’s (indeed,its type may also change, as well as the total number of ERPs). We denote by
x
all of the randomchoices which are made by
f
, so
f
deﬁnes the probability distribution
p
(
x
)
. In our example,
x
∈
R
2000
. The probability
p
(
x
)
is the product of the probability of each individual ERP choice:
p
(
x
) =
K
k
=1
p
t
k
(
x
k

θ
t
k
,x
1
,
···
,x
k
−
1
)
(1)againnotingexplicitlythattypesandparametersmaydependarbitrarilyonpreviousrandomchoices.To simplify notation, we will omit the conditioning on the values of previous ERPs, but again wishto emphasize that these dependencies are critical and cannot be ignored. By
f
k
, it should thereforebe understood that we mean
f
k

x
1
,
···
,x
k
−
1
, and by
p
t
k
(
x
k

θ
t
k
)
we mean
p
t
k
(
x
k

θ
t
k
,x
1
,
···
,x
k
−
1
)
.Generative functions as described above are, of course, easy to write. A much harder problem, andour goal in this paper, is to reason about the posterior conditional distribution
p
(
x

y
)
, where wedeﬁne
y
to be a subset of random choices which we condition on and (in an abuse of notation)
x
to be the remaining random choices. For example, we may condition
f
on the
X(i)
’s, and reasonabout the sequence of
rand
’s most likely to generate the
X(i)
’s. For the rest of this paper, wewill drop
y
and simply refer to
p
(
x
)
, but it should be understood that the goal is always to performinference in conditional distributions.
2.1 Nonstandard Interpretations of Probabilistic Programs
With an outline of probabilistic programming in hand, we now turn to nonstandard interpretations.The idea of nonstandard interpretations srcinated in model theory and mathematical logic, where itwas proposed that a set of axioms could be interpreted by different models. For example, differentialgeometry can be considered a nonstandard interpretation of classical arithmetic.In programming, a nonstandard interpretation replaces the domain of the variables in the programwith a new domain, and redeﬁnes the semantics of the operators in the program to be consistentwith the new domain. This allows reuse of program syntax while implementing new functionality.For example, the expression “
a
∗
b
” can be interpreted equally well if
a
and
b
are either scalars or2
matrices, but the “
∗
” operator takes on different meanings. Practically, many useful nonstandardinterpretations can be implemented with operator overloading: variables are redeﬁned to be objectswith operators that implement special functionality, such as tracing, reference counting, or proﬁling.For the purposes of inference in probabilistic programs, we will augment each random choice
x
k
with additional side information
s
k
, and replace each
x
k
with the tuple
x
k
,s
k
. The native interpreter for the probabilistic program can then interpret the source code as a sequence of operationson these augmented data types. For a recent example of this, we refer the reader to [24].
3 Automatic Differentiation
For probabilistic models with many continuousvalued random variables, the gradient of the likelihood
∇
x
p
(
x
)
provides local information that can signiﬁcantly improve the properties of MonteCarlo inference algorithms. For instance, Langevin MonteCarlo [20] and Hamiltonian MCMC [15]use this gradient as part of a variableaugmentation technique (described below). We would liketo be able to use gradients in the probabilisticprogram setting, but
p
(
x
)
is represented implicitlyby the program. How can we compute its gradient? We use
automatic differentiation
(AD) [3, 7],a nonstandard interpretation that automatically constructs
∇
x
p
(
x
)
. The automatic nature of AD iscritical because it relieves the programmer from handcomputing derivatives for each model; moreover, some probabilistic programs dynamically create or delete random variables making simpleclosedform expressions for the gradient very difﬁcult to ﬁnd.Unlike ﬁnite differencing, AD computes an exact derivative of a function
f
at a point (up to machineprecision). To do this, AD relies on the chain rule to decompose the derivative of
f
into derivativesof its subfunctions: ultimately, known derivatives of elementary functions are composed together toyield the derivative of the compound function. This composition can be computed as a nonstandardinterpretation of the underlying elementary functions.The derivative computation as a composition of the derivatives of the elementary functions can beperformed in different orders. In
forward mode
AD [27], computation of the derivative proceeds bypropagating perturbations of the input toward the output. This can be done by a nonstandard interpretation that extends each real value to the ﬁrst two terms of its Taylor expansion [26], overloadingeach elementary function to operate on these real “polynomials”. Because the derivatives of
f
at
c
can be extracted from the coefﬁcients of
ǫ
in
f
(
c
+
ǫ
)
, this allows computation of the gradient. In
reverse mode
AD [25], computation of the derivative proceeds by propagating sensitivities of theoutput toward the input. One way this can be done is by a nonstandard interpretation that extendseach real value into a “tape” that captures the trace of the real computation which led to that valuefrom the inputs, overloading each elementary function to incrementally construct these tapes. Sucha tape can be postprocessed, in a fashion analogous to backpropagation [21], to yield the gradient.These two approaches have complementary computational tradeoffs: reverse mode (which we use inour implementation) can compute the gradient of a function
f
:
R
n
→
R
with the same asymptotictime complexity as computing
f
, but not the same asymptotic space complexity (due to its needfor saving the computation trace), while forward mode can compute the gradient with these sameasymptotic space complexity, but with a factor of
O
(
n
)
slowdown (due to its need for constructingthe gradient out of partial derivatives along each independent variable).There are implementations of AD for many languages, including S
CHEME
(e.g., [17]), F
ORTRAN
(e.g.,
ADIFOR
[2]),
C
(e.g.,
ADOL
–
C
[8]),
C
++ (e.g.,
FADBAD
++[1]), M
ATLAB
(e.g.,
INTLAB
[22]),and M
APLE
(e.g.,
GRADIENT
[14]). See
www.autodiff.org
. Additionally, overloading and ADare well established techniques that have been applied to machine learning, and even to applicationspeciﬁc programming languages for machine learning, e.g., L
USH
[12] and D
YNA
[4]. In particular,D
YNA
applies a nonstandard interpretation for
∧
and
∨
as a semiring (
×
and
+
,
+
and
max
,
...
) ina memoizing P
ROLOG
to generalize Viterbi, forward/backward, inside/outside, etc. and uses AD toderive the outside algorithm from the inside algorithm and support parameter estimation, but unlikeprobabilistic programming, it does not model general stochastic processes and does not do generalinference over such. Our use of overloading and AD differs in that it facilitates inference in complicated models of general stochastic processes formulated as probabilistic programs. Probabilisticprogramming provides a powerful and convenient framework for formulating complicated modelsand, more importantly, separating such models from orthogonal inference mechanisms. Moreover,overloading provides a convenient mechanism for implementing many such inference mechanisms(e.g., Langevin MC, Hamiltonian MCMC, Provenance Tracking, as demonstrated below) in a probabilistic programming language.3
(define (perlinpt x y keypt power)(*255 (sum (map (lambda (p2 pow)(let ((x0 (floor (*p2 x))) (y0 (floor (*p2 y))))(*pow (2dinterp (keypt x0 y0) (keypt (+ 1 x0) y0) (keypt x0 (+ 1 y0)) (keypt (+ 1 x0) (+ 1 y0))))))powersof2 power))))(define (perlin xs ys power)(let ([keypt (mem (lambda (x y) (/ 1 (+ 1 (exp ( (gaussian 0.0 2.0)))))))])(map (lambda (x) (map (lambda (y) (perlinpt x y keypt power)) xs)) ys)))
Figure 1: Code for the structured Perlin noise generator.
2dinterp
is Bspline interpolation.
3.1 Hamiltonian MCMCAlg. 2: Hamiltonian MCMC
1:
repeat forever
2:
Gibbs step:
3:
Draw momentum
m
∼ N
(0
,σ
2
)
4:
Metropolis step:
5:
Start with current state
(
x,m
)
6:
Simulate Hamiltonian dynamics to give
(
x
′
,m
′
)
7:
Accept w/
p
= min[1
,e
(
−
H
(
x
′
,m
′
)+
H
(
x,m
))
]
8:
end;To illustrate the power of AD in probabilistic programming, we build on Hamiltonian MCMC (HMC), an efﬁcient algorithm whose popularity has been somewhat limited by the necessity of computing gradients—a difﬁcult task for complex models. Neal [15] introduces HMCas an inference method which “producesdistant proposals for the Metropolis algorithm, thereby avoiding the slow exploration of the state space that results from the diffusive behavior of simple randomwalk proposals.”HMC begins by augmenting the states space with “momentum variables”
m
. The distribution overthis augmented space is
e
H
(
x,m
)
, where the Hamiltonian function
H
decomposed into the sum of a potential energy term
U
(
x
) =
−
ln
p
(
x
)
and a kinetic energy
K
(
m
)
which is usually taken tobe Gaussian. Inference proceeds by alternating between a Gibbs step and Metropolis step: ﬁxingthe current state
x
, a new momentum
m
is sampled from the prior over
m
; then
x
and
m
are updated together by following a trajectory according to Hamiltonian dynamics. Discrete integrationof Hamiltonian dynamics requires the gradient of
H
, and must be done with a symplectic (i.e. volume preserving) integrator (following [15] we use the Leapfrog method). While this is a complexcomputation, incorporating gradient information dramatically improves performance over vanillarandomwalk style MH moves (such as Gaussian drift kernels), and its statistical efﬁciency alsoscales much better with dimensionality than simpler methods [15]. AD can also compute higherorder derivatives. For example, Hessian matrices can be used to construct blocked Metropolis moves[9] or proposals based on Newton’s method [19], or as part of Riemannian manifold methods [5].
3.2 Experiments and Results
We implemented HMC by extending B
HER
[28], a lightweight implementation of the C
HURCH
language which provides simple, but universal, MH inference. We used used an implementation of AD based on [17] that uses hygienic operator overloading to do both forward and reverse mode ADfor Scheme (the target language of the B
HER
compiler).The goal is to compute
∇
x
p
(
x
)
. By Eq. 1,
p
(
x
)
is the product of the individual choices made byeach
x
i
(though each probability can depend on previous choices, through the program evaluation).To compute
p
(
x
)
, B
HER
executes the corresponding program, accumulating likelihoods. Each timea continuous ERP is created or retrieved, we wrap it in a “tape” object which is used to track gradientinformation; as the likelihood
p
(
x
)
is computed, these tapes ﬂow through the program and throughappropriately overloaded operators, resulting in a dependency graph for the real portion of the computation. The gradient is then computed in reverse mode, by “backpropagating” along this graph.We implement an HMC kernel by using this gradient in the leapfrog integrator. Since program statesmay contain a combination of discrete and continuous ERPs, we use an overall cycle kernel whichalternates between standard MH kernel for individual discrete random variables and the HMC kernel for all continuous random choices. To decrease burnin time, we initialize the sampler by usingannealed gradient ascent (again implemented using AD).We ran two sets of experiments that illustrate two different beneﬁts of HMC with AD: automatedgradients of complex code, and good statistical efﬁciency.
Structured Perlin noise generation.
Our ﬁrst experiment uses HMC to generate modiﬁed Perlinnoise with soft symmetry structure. Perlin noise is a procedural texture used by computer graphicsartists to add realism to natural textures such as clouds, grass or tree bark. We generate Perlinlike noise by layering octaves of random but smoothly varying functions. We condition the result4
0 50 100 150 200 250 3000246Samples
D i s t a n c e t o t r u e e x p e c t a t i o n
MCMCHMC
d i a g o n a l s y m m e t r y
Figure 2: On the left: samples from the structured Perlin noise generator. On the right: convergenceof expected mean for a draw from a 3D spherical Gaussian conditioned on lying on a line.on approximate diagonal symmetry, forcing the resulting image to incorporate additional structurewithout otherwise skewing the statistics of the image. Note that the MAP solution for this problem isuninteresting, as it is a uniform image; it is the variations around the MAP that provide rich texture.We generated 48x48 images; the model had roughly 1000 variables.Fig. 2 shows the result via typical samples generated by HMC, where the approximate symmetry isclearly visible. A code snippet demonstrating the complexity of the calculations is shown in Fig. 1;this experiment illustrates how the automatic nature of the gradients is most helpful, as it would betime consuming to compute these gradients by hand—particularly since we are free to conditionusing any function of the image.
Normal distribution noisily conditioned on line (2D projection)
1:
x
∼ N
(
µ,σ
)
2:
k
∼
Bernoulli(
e
−
dist(line
,x
)noise
)
3:
Condition on
k
= 1
1.5
1.0
0.5 0.5 1.0 1.5
2
112
Complex conditioning.
For our second example, wedemonstrate the improved statistical efﬁciency of thesamples generated by HMC versus B
HER
’s standardMCMC algorithm. The goal is to sample points from acomplex 3D distribution, deﬁned by starting with a Gaussian prior, and sampling points that are noisily conditioned to be on a line running through
R
3
. This createscomplex interactions with the prior to yield a smooth, butstrongly coupled, energy landscape.Fig. 2 compares our HMC implementation with B
HER
’sstandard MCMC engine. The xaxis denotes samples,while the yaxis denotes the convergence of an estimatorof certain marginal statistics of the samples. We see thatthis estimator converges much faster for HMC, implyingthat the samples which are generated are less autocorrelated – afﬁrming that HMC is indeed makingbetter distal moves. HMC is about 5x slower than MCMC for this experiment, but the overhead is justiﬁed by the signiﬁcant improvement in the statistical quality of the samples.
4 Provenance Tracking for FineGrained Dynamic Dependency Analysis
One reason gradient based inference algorithms are effective is that the chain rule of derivativespropagates information backwards from the data up to the proposal variables. But gradients, and thechain rule, are only deﬁned for continuous variables. Is there a corresponding structure for discretechoices? We now introduce a new nonstandard interpretation based on provenance tracking (PT). Inprogramming language theory, the provenance of a variable is the history of variables and computations that combined to form its value. We use this idea to track ﬁnegrained dependency informationbetween random values and intermediate computations as they combine to form a likelihood. Wethen use this provenance information to construct good global proposals for discrete variables aspart of a novel factored multipletry MCMC algorithm.
4.1 Deﬁning and Implementing Provenance Tracking
Like AD, PT can be implemented with operator overloading. Because provenance information ismuch coarser than gradient information, the operators in PT objects have a particularly simple form;most program expressions can be covered by considering a few cases. Let
X
denote the set
{
x
i
}
of all (not necessarily random) variables in a program. Let
R
(
x
)
⊂
X
deﬁne the provenance of avariable
x
. Given
R
(
x
)
, the provenance of expressions involving
x
can be computed by breaking5