Nonstandard Interpretations of Probabilistic Programs for Efficient Inference

of 9

Please download to get full document.

View again

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
PDF
9 pages
0 downs
95 views
Share
Description
Abstract Probabilistic programming languages allow modelers to specify a stochastic process using syntax that resembles modern programming languages. Because the program is in machine-readable format, a variety of techniques from compiler design and
Transcript
  Nonstandard Interpretations of ProbabilisticPrograms for Efficient Inference David Wingate BCS / LIDS, MITwingated@mit.edu Noah D. Goodman Psychology, Stanfordngoodman@stanford.edu Andreas Stuhlm¨uller BCS, MITast@mit.edu Jeffrey M. Siskind ECE, Purdueqobi@purdue.edu Abstract Probabilistic programming languages allow modelers to specify a stochastic pro-cess using syntax that resembles modern programming languages. Because theprogram is in machine-readable format, a variety of techniques from compiler de-sign and program analysis can be used to examine the structure of the distributionrepresented by the probabilistic program. We show how nonstandard interpreta-tions of probabilistic programs can be used to craft efficient inference algorithms:information about the structure of a distribution (such as gradients or dependen-cies) is generated as a monad-like side computation while executing the program.These interpretations can be easily coded using special-purpose objects and oper-ator overloading. We implement two examples of nonstandard interpretations intwo different languages, and use them as building blocks to construct inferencealgorithms: automatic differentiation, which enables gradient based methods, andprovenance tracking, which enables efficient construction of global proposals. 1 Introduction Probabilistic programming simplifies the development of probabilistic models by allowing modelersto specify a stochastic process using syntax that resembles modern programming languages. Theselanguages permit arbitrary mixing of deterministic and stochastic elements, resulting in tremendousmodeling flexibility. The resulting programs define probabilistic models that serve as prior distribu-tions: running the (unconditional) program forward many times results in a distribution over execu-tion traces, with each trace being a sample from the prior. Examples include BLOG [13], BayesianLogic Programs [10] IBAL [18], C HURCH [6], Stochastic M ATLAB [28], and HANSEI [11].The primary challenge in developing such languages is scalable inference. Inference can be viewedas reasoning about the posterior distribution over execution traces conditioned on a particular pro-gram output, and is difficult because of the flexibility these languages present: in principle, aninference algorithm must behave reasonably for any program a user wishes to write. Sample-basedMCMC algorithms are the state-of-the-art method, due to their simplicity, universality, and compo-sitionality. But in probabilistic modeling more generally, efficient inference algorithms are designedby taking advantage of structure in distributions. How can we find structure in a distribution de-fined by a probabilistic program? A key observation is that some languages, such as C HURCH andStochastic M ATLAB , are defined in terms of an existing (non-probabilistic) language. Programs inthese languages may literally be executed in their native environments—suggesting that tools fromprogram analysis and programming language theory can be leveraged to find and exploit structurein the program for inference, much as a compiler might find and exploit structure for performance.Here, we show how nonstandard interpretations of probabilistic programs can help craft efficientinference algorithms. Information about the structure of a distribution (such as gradients, dependen-cies or bounds) is generated as a monad-like side computation while executing the program. Thisextra information can be used to, for example, construct good MH proposals, or search efficientlyfor a local maximum. We focus on two such interpretations: automatic differentiation and prove-nance tracking, and show how they can be used as building blocks to construct efficient inference1  algorithms. We implement nonstandard interpretations in two different languages (C HURCH andStochastic M ATLAB ), and experimentally demonstrate that while they typically incur some addi-tional execution overhead, they dramatically improve inference performance. 2 Background and Related Work Alg. 1: A Gaussian-Gamma mixture 1: for i=1:1000 2: if ( rand > 0.5 ) 3: X(i) = randn; 4: else 5: X(i) = gammarnd; 6: end; 7: end;We begin by outlining our setup, following [28]. We de-fine an unconditioned probabilistic program to be a pa-rameterless function f  with an arbitrary mix of stochas-tic and deterministic elements (hereafter, we will use theterm function and program interchangeably). The func-tion f  may be written in any language, but our runningexample will be M ATLAB . We allow the function to bearbitrarily complex inside, using any additional functions,recursion, language constructs or external libraries it wishes. The only constraint is that the func-tion must be self-contained, with no external side-effects which would impact the execution of thefunction from one run to another.The stochastic elements of  f  must come from a set of known, fixed elementary random primitives ,or ERPs. Complex distributions are constructed compositionally, using ERPs as building blocks. InM ATLAB , ERPs may be functions such as rand (sample uniformly from [0,1]) or randn (samplefrom a standard normal). Higher-order random primitives, such as nonparametric distributions, mayalso be defined, but must be fixed ahead of time. Formally, let T   be the set of ERP types. We assumethat each type t ∈ T   is a parametric family of distributions p t ( x | θ t ) , with parameters θ t .Now, consider what happens while executing f  . As f  is executed, it encounters a series of ERPs.Alg. 1 shows an example of a simple f  written in M ATLAB with three syntactic ERPs: rand , randn , and gammarnd . During execution, depending on the return value of each call to rand ,different paths will be taken through the program, and different ERPs will be encountered. We callthis path an execution trace. A total of 2000 random choices will be made when executing this f  .Let f  k | x 1 , ··· ,x k − 1 be the k ’th ERP encountered while executing f  , and let x k be the value it returns.Note that the parameters passed to the k ’th ERP may change depending on previous x k ’s (indeed,its type may also change, as well as the total number of ERPs). We denote by x all of the randomchoices which are made by f  , so f  defines the probability distribution p ( x ) . In our example, x ∈ R 2000 . The probability p ( x ) is the product of the probability of each individual ERP choice:  p ( x ) = K   k =1  p t k ( x k | θ t k ,x 1 , ··· ,x k − 1 ) (1)againnotingexplicitlythattypesandparametersmaydependarbitrarilyonpreviousrandomchoices.To simplify notation, we will omit the conditioning on the values of previous ERPs, but again wishto emphasize that these dependencies are critical and cannot be ignored. By f  k , it should thereforebe understood that we mean f  k | x 1 , ··· ,x k − 1 , and by p t k ( x k | θ t k ) we mean p t k ( x k | θ t k ,x 1 , ··· ,x k − 1 ) .Generative functions as described above are, of course, easy to write. A much harder problem, andour goal in this paper, is to reason about the posterior conditional distribution p ( x | y ) , where wedefine y to be a subset of random choices which we condition on and (in an abuse of notation) x to be the remaining random choices. For example, we may condition f  on the X(i) ’s, and reasonabout the sequence of  rand ’s most likely to generate the X(i) ’s. For the rest of this paper, wewill drop y and simply refer to p ( x ) , but it should be understood that the goal is always to performinference in conditional distributions. 2.1 Nonstandard Interpretations of Probabilistic Programs With an outline of probabilistic programming in hand, we now turn to nonstandard interpretations.The idea of nonstandard interpretations srcinated in model theory and mathematical logic, where itwas proposed that a set of axioms could be interpreted by different models. For example, differentialgeometry can be considered a nonstandard interpretation of classical arithmetic.In programming, a nonstandard interpretation replaces the domain of the variables in the programwith a new domain, and redefines the semantics of the operators in the program to be consistentwith the new domain. This allows reuse of program syntax while implementing new functionality.For example, the expression “ a ∗ b ” can be interpreted equally well if  a and b are either scalars or2  matrices, but the “ ∗ ” operator takes on different meanings. Practically, many useful nonstandardinterpretations can be implemented with operator overloading: variables are redefined to be objectswith operators that implement special functionality, such as tracing, reference counting, or profiling.For the purposes of inference in probabilistic programs, we will augment each random choice x k with additional side information s k , and replace each x k with the tuple  x k ,s k  . The native inter-preter for the probabilistic program can then interpret the source code as a sequence of operationson these augmented data types. For a recent example of this, we refer the reader to [24]. 3 Automatic Differentiation For probabilistic models with many continuous-valued random variables, the gradient of the like-lihood ∇ x  p ( x ) provides local information that can significantly improve the properties of Monte-Carlo inference algorithms. For instance, Langevin Monte-Carlo [20] and Hamiltonian MCMC [15]use this gradient as part of a variable-augmentation technique (described below). We would liketo be able to use gradients in the probabilistic-program setting, but p ( x ) is represented implicitlyby the program. How can we compute its gradient? We use automatic differentiation (AD) [3, 7],a nonstandard interpretation that automatically constructs ∇ x  p ( x ) . The automatic nature of AD iscritical because it relieves the programmer from hand-computing derivatives for each model; more-over, some probabilistic programs dynamically create or delete random variables making simpleclosed-form expressions for the gradient very difficult to find.Unlike finite differencing, AD computes an exact derivative of a function f  at a point (up to machineprecision). To do this, AD relies on the chain rule to decompose the derivative of  f  into derivativesof its sub-functions: ultimately, known derivatives of elementary functions are composed together toyield the derivative of the compound function. This composition can be computed as a nonstandardinterpretation of the underlying elementary functions.The derivative computation as a composition of the derivatives of the elementary functions can beperformed in different orders. In forward mode  AD [27], computation of the derivative proceeds bypropagating perturbations of the input toward the output. This can be done by a nonstandard inter-pretation that extends each real value to the first two terms of its Taylor expansion [26], overloadingeach elementary function to operate on these real “polynomials”. Because the derivatives of  f  at c can be extracted from the coefficients of  ǫ in f  ( c + ǫ ) , this allows computation of the gradient. In reverse mode  AD [25], computation of the derivative proceeds by propagating sensitivities of theoutput toward the input. One way this can be done is by a nonstandard interpretation that extendseach real value into a “tape” that captures the trace of the real computation which led to that valuefrom the inputs, overloading each elementary function to incrementally construct these tapes. Sucha tape can be postprocessed, in a fashion analogous to backpropagation [21], to yield the gradient.These two approaches have complementary computational tradeoffs: reverse mode (which we use inour implementation) can compute the gradient of a function f  : R n → R with the same asymptotictime complexity as computing f  , but not the same asymptotic space complexity (due to its needfor saving the computation trace), while forward mode can compute the gradient with these sameasymptotic space complexity, but with a factor of  O ( n ) slowdown (due to its need for constructingthe gradient out of partial derivatives along each independent variable).There are implementations of AD for many languages, including S CHEME (e.g., [17]), F ORTRAN (e.g., ADIFOR [2]), C (e.g., ADOL – C [8]), C ++ (e.g., FADBAD ++[1]), M ATLAB (e.g., INTLAB [22]),and M APLE (e.g., GRADIENT [14]). See www.autodiff.org . Additionally, overloading and ADare well established techniques that have been applied to machine learning, and even to application-specific programming languages for machine learning, e.g., L USH [12] and D YNA [4]. In particular,D YNA applies a nonstandard interpretation for ∧ and ∨ as a semiring ( × and + , + and max , ... ) ina memoizing P ROLOG to generalize Viterbi, forward/backward, inside/outside, etc. and uses AD toderive the outside algorithm from the inside algorithm and support parameter estimation, but unlikeprobabilistic programming, it does not model general stochastic processes and does not do generalinference over such. Our use of overloading and AD differs in that it facilitates inference in com-plicated models of general stochastic processes formulated as probabilistic programs. Probabilisticprogramming provides a powerful and convenient framework for formulating complicated modelsand, more importantly, separating such models from orthogonal inference mechanisms. Moreover,overloading provides a convenient mechanism for implementing many such inference mechanisms(e.g., Langevin MC, Hamiltonian MCMC, Provenance Tracking, as demonstrated below) in a prob-abilistic programming language.3  (define (perlin-pt x y keypt power)(*255 (sum (map (lambda (p2 pow)(let ((x0 (floor (*p2 x))) (y0 (floor (*p2 y))))(*pow (2d-interp (keypt x0 y0) (keypt (+ 1 x0) y0) (keypt x0 (+ 1 y0)) (keypt (+ 1 x0) (+ 1 y0))))))powers-of-2 power))))(define (perlin xs ys power)(let ([keypt (mem (lambda (x y) (/ 1 (+ 1 (exp (- (gaussian 0.0 2.0)))))))])(map (lambda (x) (map (lambda (y) (perlin-pt x y keypt power)) xs)) ys))) Figure 1: Code for the structured Perlin noise generator. 2d-interp is B-spline interpolation. 3.1 Hamiltonian MCMCAlg. 2: Hamiltonian MCMC 1: repeat forever 2: Gibbs step: 3: Draw momentum m ∼ N  (0 ,σ 2 ) 4: Metropolis step: 5: Start with current state ( x,m ) 6: Simulate Hamiltonian dynamics to give ( x ′ ,m ′ ) 7: Accept w/  p = min[1 ,e ( − H  ( x ′ ,m ′ )+ H  ( x,m )) ] 8: end;To illustrate the power of AD in proba-bilistic programming, we build on Hamil-tonian MCMC (HMC), an efficient algo-rithm whose popularity has been some-what limited by the necessity of comput-ing gradients—a difficult task for com-plex models. Neal [15] introduces HMCas an inference method which “producesdistant proposals for the Metropolis algo-rithm, thereby avoiding the slow explo-ration of the state space that results from the diffusive behavior of simple random-walk proposals.”HMC begins by augmenting the states space with “momentum variables” m . The distribution overthis augmented space is e H  ( x,m ) , where the Hamiltonian function H  decomposed into the sum of a potential energy term U  ( x ) = − ln  p ( x ) and a kinetic energy K  ( m ) which is usually taken tobe Gaussian. Inference proceeds by alternating between a Gibbs step and Metropolis step: fixingthe current state x , a new momentum m is sampled from the prior over m ; then x and m are up-dated together by following a trajectory according to Hamiltonian dynamics. Discrete integrationof Hamiltonian dynamics requires the gradient of  H  , and must be done with a symplectic (i.e. vol-ume preserving) integrator (following [15] we use the Leapfrog method). While this is a complexcomputation, incorporating gradient information dramatically improves performance over vanillarandom-walk style MH moves (such as Gaussian drift kernels), and its statistical efficiency alsoscales much better with dimensionality than simpler methods [15]. AD can also compute higher-order derivatives. For example, Hessian matrices can be used to construct blocked Metropolis moves[9] or proposals based on Newton’s method [19], or as part of Riemannian manifold methods [5]. 3.2 Experiments and Results We implemented HMC by extending B HER [28], a lightweight implementation of the C HURCH language which provides simple, but universal, MH inference. We used used an implementation of AD based on [17] that uses hygienic operator overloading to do both forward and reverse mode ADfor Scheme (the target language of the B HER compiler).The goal is to compute ∇ x  p ( x ) . By Eq. 1, p ( x ) is the product of the individual choices made byeach x i (though each probability can depend on previous choices, through the program evaluation).To compute p ( x ) , B HER executes the corresponding program, accumulating likelihoods. Each timea continuous ERP is created or retrieved, we wrap it in a “tape” object which is used to track gradientinformation; as the likelihood p ( x ) is computed, these tapes flow through the program and throughappropriately overloaded operators, resulting in a dependency graph for the real portion of the com-putation. The gradient is then computed in reverse mode, by “back-propagating” along this graph.We implement an HMC kernel by using this gradient in the leapfrog integrator. Since program statesmay contain a combination of discrete and continuous ERPs, we use an overall cycle kernel whichalternates between standard MH kernel for individual discrete random variables and the HMC ker-nel for all continuous random choices. To decrease burn-in time, we initialize the sampler by usingannealed gradient ascent (again implemented using AD).We ran two sets of experiments that illustrate two different benefits of HMC with AD: automatedgradients of complex code, and good statistical efficiency. Structured Perlin noise generation. Our first experiment uses HMC to generate modified Perlinnoise with soft symmetry structure. Perlin noise is a procedural texture used by computer graphicsartists to add realism to natural textures such as clouds, grass or tree bark. We generate Perlin-like noise by layering octaves of random but smoothly varying functions. We condition the result4  0 50 100 150 200 250 3000246Samples    D   i  s   t  a  n  c  e   t  o   t  r  u  e  e  x  p  e  c   t  a   t   i  o  n MCMCHMC d   i   a   g  o  n  a  l   s   y  m  m  e  t  r   y   Figure 2: On the left: samples from the structured Perlin noise generator. On the right: convergenceof expected mean for a draw from a 3D spherical Gaussian conditioned on lying on a line.on approximate diagonal symmetry, forcing the resulting image to incorporate additional structurewithout otherwise skewing the statistics of the image. Note that the MAP solution for this problem isuninteresting, as it is a uniform image; it is the variations around the MAP that provide rich texture.We generated 48x48 images; the model had roughly 1000 variables.Fig. 2 shows the result via typical samples generated by HMC, where the approximate symmetry isclearly visible. A code snippet demonstrating the complexity of the calculations is shown in Fig. 1;this experiment illustrates how the automatic nature of the gradients is most helpful, as it would betime consuming to compute these gradients by hand—particularly since we are free to conditionusing any function of the image. Normal distribution noisily condi-tioned on line (2D projection) 1: x ∼ N  ( µ,σ ) 2: k ∼ Bernoulli( e − dist(line ,x )noise ) 3: Condition on k = 1  1.5  1.0  0.5 0.5 1.0 1.5  2  112 Complex conditioning. For our second example, wedemonstrate the improved statistical efficiency of thesamples generated by HMC versus B HER ’s standardMCMC algorithm. The goal is to sample points from acomplex 3D distribution, defined by starting with a Gaus-sian prior, and sampling points that are noisily condi-tioned to be on a line running through R 3 . This createscomplex interactions with the prior to yield a smooth, butstrongly coupled, energy landscape.Fig. 2 compares our HMC implementation with B HER ’sstandard MCMC engine. The x-axis denotes samples,while the y-axis denotes the convergence of an estimatorof certain marginal statistics of the samples. We see thatthis estimator converges much faster for HMC, implyingthat the samples which are generated are less autocorrelated – affirming that HMC is indeed makingbetter distal moves. HMC is about 5x slower than MCMC for this experiment, but the overhead is justified by the significant improvement in the statistical quality of the samples. 4 Provenance Tracking for Fine-Grained Dynamic Dependency Analysis One reason gradient based inference algorithms are effective is that the chain rule of derivativespropagates information backwards from the data up to the proposal variables. But gradients, and thechain rule, are only defined for continuous variables. Is there a corresponding structure for discretechoices? We now introduce a new nonstandard interpretation based on provenance tracking (PT). Inprogramming language theory, the provenance of a variable is the history of variables and computa-tions that combined to form its value. We use this idea to track fine-grained dependency informationbetween random values and intermediate computations as they combine to form a likelihood. Wethen use this provenance information to construct good global proposals for discrete variables aspart of a novel factored multiple-try MCMC algorithm. 4.1 Defining and Implementing Provenance Tracking Like AD, PT can be implemented with operator overloading. Because provenance information ismuch coarser than gradient information, the operators in PT objects have a particularly simple form;most program expressions can be covered by considering a few cases. Let X  denote the set { x i } of all (not necessarily random) variables in a program. Let R ( x ) ⊂ X  define the provenance of avariable x . Given R ( x ) , the provenance of expressions involving x can be computed by breaking5
Related Search
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks