Lecture 23: Probabilistic Programming Languages

Logistics:

slides for this lecture can be found here.
Today is a guest lecture by Sam Stites (me!)
I’m a Ph.D. Student working with Steven in the NeuPPL lab (under the PRL umbrella)
- PRL research: racket development, semantics of PL, how languages interoperate, code LLMs, how software gets written.
- NeuPPL: “A pretty cool PPL lab.” (more on what a PPL is today!)
  Continuations Recap
up till this point you’ve been working on continuation passing and using handlers.
An important aspect of CPS is that it is a way to model computation that is both simple and generalizable to a wide range of practical effects
Effects seen so far include exception handling, variable assignment, general control flow — also can encode printing, stateful computation, non-determinism, modeling uncertainty.
Today we are going to look at modeling uncertainty, but we are going to backtrack a little and double down on the domain, as opposed to a CPS-style interpreter.
Languages of Uncertainty
PL applications are ubiquitous. Outside of general-purpose programming that you are familiar with: interactive theorem provers, cryptography, networking, graphics, and even the law (see the Prolala workshop). One way we apply language design is by providing domain specific languages, codify safety guarantees so that their users can’t perform undefined behavior.
Today we are going to talk about encoding a language for statisticians and data scientists.
At the core of these fields is reasoning about probability distributions (ie: making a model of what is going on in the world) and encoding evidence into that model (ie: data sets and things that are “observable”).
We’ll design a probabilistic programming language, which is a language-oriented approach to formalizing this informal mathematical language.
One of the big selling points of PPLs is that it democratizes machine learning: it automates components of both the statistician’s job – we can automate some of the math they do, as well as the engineer’s job – who’s job is to operationalize the modeling software.
To keep things simple, we’re going to stick to discrete probability and model probabilities exactly. Depending on how the lecture goes I can talk about what happens in the continuous setting with approximation, challenges, extentensions to the language, or we can do a CPS translation.
Uncertainty
Before you saw program’ whose outputs are fully determined by their inputs and environments. For instance, here’s a straightforward program in calc:
```
let x = 3 in
let y = 4 in
return x + y
```
We will now look at some probabilistic program in a language we will call disc. In this setting we are returning a distribution over the outputs of the program. See, for instance, this example code that simulates flipping two coins:
```
let x = flip 1/3 in
let y = flip 1/4 in
return x \\/ y
```
A lot is happening here! We need to understand a few things:
- flips: what is a distribution?
  - intuitively this is a “coin flip” but with a bias towards “heads” or “tails” (represented by true and false).
  - flip 1/3 is flipping a coin with a 1/3 probability of coming up heads, 2/3 probability of coming up tails (you can make any coin biased by bending it into a U-shape).
  - A distribution for some type associates it’s inhabitants to probabilities. the type of a distribution is dist : value -> [0, 1]. where [0, 1] is a float between 0 and 1.
  - in the exact setting you can think of this type as an association map lookup. For flip 1/3 : value -> [0,1], with value being bool, you can imagine the distribution as a map {true = 1/3; false = 2/3} and we are just looking up keys of Booleans.
- the program returns a disjunction of x and y (logical “or”).
  - so everything up to the return is a model of the behavior which includes probabilistic events and control flow.
  - the return itself is how we ask a query about the model: in this case we want the probability of “one of the coins coming up heads”.
- We can model all possible outcomes of the above in a truth table, which also touches on probabilistic independence.
  - let’s start by looking at x || y represented via truth table:

x  |  y  | x \\/ y
------------------
T  |  T  |    T         <-,
T  |  F  |    T         <-|--- these are things we want to query in the return
F  |  T  |    T         <-'
F  |  F  |    F 

We can annotate this with probabilities to find our probabilities

       P(x)       |      P(y)       |   Pr(x, y)
------------------------------------------------------
  Pr(x=T) = 1/3   |  Pr(y=T) = 1/4  |     1/12        <-,
  Pr(x=T) = 1/3   |  Pr(y=F) = 3/4  |     3/12        <-|--- sum these to get the probability: 6/12
  Pr(x=F) = 2/3   |  Pr(y=T) = 1/4  |     2/12        <-'
  Pr(x=F) = 2/3   |  Pr(y=F) = 3/4  |     6/12

Note that each coinflip is independent of each other, when we ask a question like “what is the probability of both x and y OCCURING at the same time together?” we multiply the weights.

again: modeling, intuitively, is when we want to encode some data generating process into a program – it doesn’t mean that it is the “right” program, but it does mean that it should be able to encode observable information from the world into the model to better reflect the “true” data generating process.
Our two coinflips don’t encode any information, so if this is our program, we already know everything that is going on in the system. What if, instead, observe that one of the coins is heads? If this happens, then actually, we can update probabilities of x and y to better reflect what is going on:

let x = flip 1/3 in
let y = flip 1/4 in
observe x \\/ y
return x

It’s as if we drew a line out of the last row of our annotated truth table (which I can’t do in html, unfortunately):

   P(x)       |      P(y)       |   Pr(x, y)
--------------+-----------------+------------   .------- these are what we are querying for: x
Pr(x=T) = 1/3 |  Pr(y=T) = 1/4  |     1/12   <,/  <-,
Pr(x=T) = 1/3 |  Pr(y=F) = 3/4  |     3/12   <'   <-|--- sum these to get the total probability: 4/6
Pr(x=F) = 2/3 |  Pr(y=T) = 1/4  |     2/12        <-'
Pr(x=F) = 2/3 |  Pr(y=F) = 3/4  |     6/12        <- this line is now impossible

We have constructed the conditional probability of x, given that we’ve seen that x || y, and this results in an exact distribution after incorporating the evidence, the posterior distribution.
intuitively, when we run a traditional program we go from “parameters -> program -> output” but in the language of statistics we go from “output -> model -> parameters”
Recall that the model of the program is two unfair coin-flips (you get these by bending the coins). What we are saying is that the model would better reflect the evidence if we updated x’s probability to be a fair coin flip.

an example distribution

We want to take the above intuition from above and define some rules which let us operationally reason about exact probability
Recall the type of a distribution: dist : value -> [0, 1]
we want our program to return a distribution so that when we query it, we can get out the probability of some event happening (in this case the boolean formula at the end of the program). Let’s imagine we have a program \(e\), then we would like to have our semantics turn this into a distribution:
\[[\![e]\!] : \texttt{value} \to [0, 1]\]
let’s look at an example of a simple flip program: flip 1/3
we want this to represent the probability that something is true. So the output of this program will be bool -> [0, 1]. We could represent this with a data structure (as mentioned before): fun val -> {true = 1/3; false = 2/3}[val], but this wont work with observe statements. To fix this we’ll actually break this into two steps where we first find the unnormalized distribution (ie: a probability which doesn’t add up to 1.0). We can always normalize a distribution by enumerating and summing over all possible outcomes (the law of total probability):

\[[\![e]\!]_{\text{norm}}(T) = \frac{[\![e]\!](T)}{[\![e]\!](T) + [\![e]\!](F) }\]

Now, our semantics of flip will just return the probability of an event occurring:

\[[\![\texttt{flip}~1/3]\!](v) = \begin{cases} 1/3& \quad \text{if }v = T\\ 2/3& \quad \text{if }v = F\\ \end{cases}\]

Which still gets us what we need when we want to find the normalized probability distribution:

\[[\![\texttt{flip}~1/3]\!]_{\text{norm}}(T) = \frac{[\![\texttt{flip}~1/3]\!](T)}{[\![\texttt{flip}~1/3]\!](T) + [\![\texttt{flip}~1/3]\!](F) }=\frac{1/3}{1/3 + 2/3}\]

semantics

let’s go over the rest of our semantics now

\[[\![\texttt{flip}~\theta]\!](v) = \begin{cases} \theta& \quad \text{if }v = T\\ 1-\theta& \quad \text{if }v = F\\ \end{cases}\] \[[\![\texttt{return}~e]\!](v) = \begin{cases} 1\quad& \text{if }v = [\![e]\!]\\ 0\quad& \text{otherwise}\\ \end{cases}\] \[[\![x \leftarrow e_1; e_2]\!](v) = \sum_{v'} [\![{e_1}]\!](v') \times [\![{e_2[x \mapsto v']}]\!](v)\] \[[\![\texttt{observe}~e_1; e_2]\!](v) = \sum_{\{v' \mid [\![e_1]\!] = T\}]\!]} [\![{e_2}]\!](v)\]

A probabilistic recursive-style interpreter

we’re going to stick to our recursive-style interpreter today.
as a warm-up we’ll write a little Boolean language. this is a straightforward, deterministic language, which evaluates a Boolean formula. ```ocaml type expr = | Bool of bool | And of expr * expr | Or of expr * expr | Let of string * expr * expr | Var of string

(* e[x |-> e’] *) let rec subst (e : expr) (x : string) (e’ : expr) : expr = match e with | And(e1, e2) -> And(subst e1 x e’, subst e2 x e’) | Or(e1, e2) -> Or(subst e1 x e’, subst e2 x e’) | Let(s, e1, e2) -> if s = x then Let(s, e1, e2) else Let(s, subst e1 x e’, subst e2 x e’) | Var(s) -> if s = x then e’ else Var(s) | x -> x

let rec interp (e:expr) : bool = match e with | Bool(b) -> b | Var(x) -> raise(Runtime(“unbound”)) | And(e1, e2) -> (interp e1) && (interp e2) | Or(e1, e2) -> (interp e1) || (interp e2) | Let(v, bound, body) -> interp (subst body v bound)

let () = assert (let expected = false in let actual = interp( Let(“x”, Bool(true), Let(“x”, Bool(false), Var(“x”) ))) in expected = actual); assert (let expected = false in let actual = interp( Let(“x”, Let(“x”, Bool(true), Var(“x”)), Let(“y”, Bool(false), And(Var(“x”), Var(“y”)) ))) in expected = actual); assert (let expected = true in let actual = interp( Let(“x”, Bool(true), Let(“y”, Bool(false), Or(Var(“x”), Var(“y”)) ))) in expected = actual);;


- Now we add some probabilistic terms:

- we still need to be able to evaluate a formula, but because of substitution there won't be any variables.
```ocaml
let rec eval (e:expr) : bool =
  match e with
  | Bool(b) -> b
  | And(e1, e2) -> eval(e1) && eval(e2)
  | Or(e1, e2) -> eval(e1) || eval(e2)
  | _ -> raise(Runtime("not a boolean expression!"))

we can directly encode our semantics, which is a straightforward translation:

let rec interp (e:expr) (v : bool): float =
match e with
| Var(x) -> raise(Runtime("unbound"))
| Flip(p) -> if v then p else 1.0 -. p
| Ret(e) -> if v = eval(e) then 1.0 else 0.0
| Let(x, bound, exp) ->
  let btrue = interp bound true in
  let etrue = interp (subst exp x (Bool true)) v in
  let bfalse = interp bound false in
  let efalse = interp (subst exp x (Bool false)) v in
  btrue *. etrue +. bfalse *. efalse
| Observe(e1, e2) -> if eval(e1) = true then interp e2 v else 0.0
| _ -> raise(Runtime("needs to be lifted into a return!"))

note that instead of returning a distribution, we are relying on currying and we can get access to the query event v directly in the interpreter.

normalizing is just enumeration over Booleans:

let normalized (e:expr) (q : bool) : float = (interp e q) /. ((interp e true) +. (interp e false))

and we can write some tests to validate this.

let approx a e = Float.abs (a -. e) < 0.00000001
let ptrue (e:expr) : float = normalized e true
let tests =
assert (let expected = 0.6 in
        let actual = ptrue(Flip(0.6)) in
        approx expected actual);
assert (let expected = 0.6 in
        let actual = ptrue(Let("x", Flip(0.6), Ret(Var("x")))) in
        approx expected actual);
assert (let expected = 0.68 in
        let actual = ptrue(
            Let("x", Flip(0.6),
            Let("y", Flip(0.2),
            Ret(Or(Var("x"), Var("y")))))) in
        approx expected actual);
assert (let expected = 0.5 in
        let actual = ptrue(
            Let("x", Flip(0.2),
            Let("y", Flip(0.25),
            Observe(Or(Var("x"), Var("y")),
            Ret(Var("x"))
            )))) in
        approx expected actual);
assert (let expected = 0.833333333 in
        let actual = ptrue(
            Let("x", Flip(0.6),
            Let("y", Flip(0.3),
            Observe(Or(Var("x"), Var("y")),
            Ret(Var("x"))
            )))) in
        approx expected actual);;

P.S. we also touched if statements. Extending this interpreter with if-statements looks like this:

type expr =
  ...
  | If      of expr * expr * expr

let rec subst (e : expr) (x : string) (e' : expr) : expr = match e with
  ... 
  | If(g, e1, e2)   -> If(subst g x e', subst e1 x e', subst e2 x e')

let rec interp (e:expr) (v : bool): float =
  match e with
  ...
  | If(guard, truthy, falsey) ->
    let gtrue  = interp (Ret(guard)) true in
    let gfalse = interp (Ret(guard)) false in
    let ptrue  = interp truthy v in
    let pfalse = interp falsey v in
    gtrue *. ptrue +. gfalse *. pfalse

Lecture 23: Probabilistic Programming Languages

Continuations Recap

Languages of Uncertainty

Uncertainty

an example distribution

semantics

A probabilistic recursive-style interpreter