  • Outcomes for today:
    • You should be able to identify if a language is memory safe, and be able to write programs that demonstrate memory safety errors in languages that are not memory safe.
    • You should understand what the purpose of a garbage-collector is and how to implement a simple tracing-based garbage collection scheme.

Micro-C and memory safety

  • Today our goal is to study memory safety. To do this, we will introduce a tiny C-like language with free and malloc:
type microc =
  | Let of { id: string; binding: microc; body: microc }
  | Var of string
  | Malloc
  | Free of microc
  | Set of {loc: microc; value: microc}
  | Deref of {loc: microc}
  | Num of int
  • The semantics of Micro-C are given as follows:
(* the state of micro-c programs are similar to heap-manipulating program's we've seen so far,
 except now we maintain a **free-list** which tracks which portions of the heap are occupied. *)
                env   heap            freelist
let x = 10 in x, [],  [-1, -1, ...],  [Free, Free, Free, ...]
--> x, [x  10], []
--> 10

(* malloc generates a fresh (unused) location but does *not* initialize the memory and sets that
   location to occupied in the free-list *)
malloc, [], [-1, -1, ...], [Free, Free, Free, ...]
--> loc 0x0, [], [-1, -1, ...], [Occupied, Free, Free, ...]

(* free deallocates a location to make it available for later. evaluates to the
   number 0, for completeness *)
free 0x0, [], [-1, -1, ...], [Occupied, Free, Free, ...]
--> 0, [], [-1, -1, ...], [Free, Free, Free, ...]

(* set updates a location in memory, returns 0 for completeness *)
set 0x0 10, [], [-1, -1, ...], [Free, Free, Free, ...]
--> 0, [], [10, -1, ...], [Free, Free, Free, ...]

(* deref fetches a value stored in the heap *)
deref 0x0, [], [-1, -1, ...], [Free, Free, Free, ...]
--> -1
  • Given these semantics, we can make an interpreter:
let heap_size = 100

let empty_heap () = { free_list = Array.make heap_size Free; heap = Array.make heap_size (VNum (-1))}

module StringMap = Map.Make(String)

type microcenv = value StringMap.t

(* out of memory exception *)
exception Oom

(* out of memory exception *)
exception Runtime

(* find a free index in the free_list or raise an Oom exception *)
let find_free_idx (h:heap) : int =
  let rec helper (h:heap) (idx:int) : int =
    if (idx >= Array.length h.free_list) then raise Oom else
      if (Array.get h.free_list idx) = Free then idx else helper h idx in
  helper h 0

let rec interp_c (heap:heap) (env: microcenv) (microc:microc) : value =
  match microc with
  | Let {id; binding; body} ->
    let bindv = interp_c heap env binding in
    let new_env = StringMap.add id bindv env in
    interp_c heap new_env body
  | Var(s) ->
    (match StringMap.find_opt s env with
     | Some(v) -> v
     | None -> raise Runtime)
  | Malloc ->
    let idx = find_free_idx heap in
    (* update free list *)
    Array.set heap.free_list idx false;
  | Set {loc; value} ->
    let l = (match interp_c heap env loc with
          VLoc(v) -> v
        | VNum(_) -> raise Runtime) in
    let v = interp_c heap env value in
    Array.set heap.heap l v;
  | Deref { loc } ->
    let l = (match interp_c heap env loc with
       VLoc(v) -> v
     | VNum(_) -> raise Runtime) in
    Array.get heap.heap l
  | Num(n) -> (VNum(n))
  | Free(exp) ->
    let l = (match interp_c heap env exp with
          VLoc(v) -> v
        | VNum(_) -> raise Runtime) in
    Array.set heap.free_list l Free;
  • Notice the semantics of Micro-C do not prevent us from doing the following:
    • Reading from memory before it is initialized.
    • Writing to memory that has not been initialized.
    • Freeing the same memory location more than once.
    • Continuing to use a memory location after it has been freed.
  • Broadly speaking, the above kinds of errors are called memory safety errors: errors that occurr due to accessing or updating invalid regions of memory.
    • Memory safety is a critical component of a software system, and failure to maintain memory safety is the source of countless bugs and woes. See here and here for more details.
  • It may seem unbelievable, but real C code does not prevent any of these kinds of errors!
  • Look at the following C program (use this website to run it):
int main()
    int x = 10;
    int z = 20;
    int* y = &x;
    *(y+1) = 5;
    printf("x: %d, z: %d", x, z);

    return 0;
  • What do you think this program prints? The answer will probably surprise you!
    • It prints x: 10, z: 5!!! The value for z was changed even though we never explicitly created a pointer to z.
  • If memory safety is so dangerous, why do our languages permit these operations in the first place?
    • By far the primary reason is performance: sometimes it is critical to do low-level memory updates to speed up certain algorithms or computations

Compiling with memory safety

  • So, how do we design languages with dynamic allocation and pointers that avoid memory errors? This is a language design problem
    • We call languages that prevent all memory-safety errors memory-safe languages. This is a broad definition and is difficult to make precise (and many purportedly memory-safe languages do permit ways of performing memory-unsafe computation)
  • To study this question, we will study how to compile a memory-safe language into a memory-unsafe language and the challenges involved with doing that.
  • Here is our tiny memory safe language:
type ast =
    Let of  string * ast * ast
  | Unbox of ast
  | Box of string
  | Set of string * ast
  | Var of string
  | Num of int
  • What makes it a memory safe language?
    • Type-system (not shown, but assume it has the usual type-system for mutable references we’ve seen in class) prevents unboxing of non-references
    • No ability to manipulate pointers to dereference invalid memory addresses
    • Syntactic structure of box requires providing an initial value for heap allocation
  • A simple compiler:
let rec compile_leaky (ast:ast) (f:int ref) : microc =
  match ast with
  | Box(e) ->
    let e_v = compile_leaky e f in
    (* whenever we allocate a new value, garbage collect *)
    let loc_name = fresh_name f in
    Let{id=loc_name; binding=Malloc;
       body = Let {id="_"; binding=Set { loc=Var(loc_name); value=e_v };
          body=Var(loc_name) }}
  | Num(n) -> Num(n)
  | Var(s) -> Var(s)
  | Let(id, binding, body) ->
    Let {id=id; binding=compile_leaky binding f; body= compile_leaky body f}
  | Set(loc, value) ->
    Set { loc = Var(loc); value = compile_leaky value f }
  | Unbox(e) ->
    Deref { loc = compile_leaky e f }
  • Example outputs:
> compile_leaky Let("x", Box(Num(10)), Unbox(Var("x"))) (ref 0);;
- : microc =
 {id = "x";
  binding =
    {id = "fresh0"; binding = Malloc;
     body =
       {id = "_"; binding = Set {loc = Var "fresh0"; value = Num 10};
        body = Var "fresh0"}};
  body = Deref {loc = Var "x"}}
  • Notice: this compilation procedure only ever allocates memory, it never frees it! Memory that is allocated but never freed is called a memory leak.
    • Memory leaks are not generally regarded as memory safety errors since they generally do not lead to dangerous systems behavior; they mostly affect performance (and, if you eventually run out of memory due to a memory leak, they can cause your application or even system to crash)

MarkGarbage collection

  • One of the performance benefits of low-level memory management provided by micro-C is the ability to have fine-grained control over the amount of allocated memory by freeing memory when it is no longer needed.
  • Freeing memory is a common source of memory safety issues, so it is not compatible with our memory-safe language.
  • How do we efficiently utilize memory in memory-safe languages? The answer is garbage collection, which refer to schemes for reclaiming unused memory when it is no longer needed
  • There are many ways of implementing garbage collection, and it is a very active area of language development. We will cover a simple strategy called tracing (or mark-and-sweep) garbage collection.
  • First, we will extend the AST of Micro-C with a Gc command that triggers a garbage collection when it is executed:
type microc =
  | Let of { id: string; binding: microc; body: microc }
  | Var of string
  | Malloc
  | Gc
  | Free of microc
  | Set of {loc: microc; value: microc}
  | Deref of {loc: microc}
  | Num of int
  • When compiling to Micro-C, a memory-safe language can emit a Gc command to trigger a garbage collection cycle and clean up free memory
    • In practice, there are many heuristics and strategies for choosing when to trigger garbage collection, and the performance of memory-safe languages can be very sensitive to this choice.
  • We can modify our compiler to emit garbage collection commands whenever a boxed value is allocated:
let rec compile_gc (ast:ast) (f:int ref) : microc =
  match ast with
  | Box(e) ->
    (* whenever we allocate a new value, garbage collect *)
    let e_v = compile_leaky e f in
    let loc_name = fresh_name f in
    let res : microc = Let{id=loc_name; binding=Malloc;
                           body = Let {id="_"; binding=Set { loc=Var(loc_name); value=e_v };
                                       body=Var(loc_name) }} in
    let body : microc = Let { id = "_"; binding = Gc; body = Var("res")} in
    Let { id = "res"; binding = res; body=body }
  | Num(n) -> Num(n)
  | Var(s) -> Var(s)
  | Let(id, binding, body) ->
    Let {id=id; binding=compile_gc binding f; body= compile_gc body f}
  | Set(loc, value) ->
    Set { loc = Var(loc); value = compile_gc value f }
  | Unbox(e) ->
    Deref { loc = compile_gc e f }
  • Now, how do we implement the Gc instruction?
  • Key idea:
    • At the point of garbage collection, identify roots, which are in-scope locations in the program
    • Then, update the free-list with all locations that are reachable from the roots
  • In micro-c, root identification is quite simple: we can inspect the current environment to get all in-scope variables
    • In languages more like assembly, it can be much trickier to do root identification!
  • Ultimately, this is a straightforward graph reachability algorithm, which we can implement as follows:

let rec gc (root:value) (heap:heap) : unit =
  (* traverse the heap beginning from `root` and mark every reachable address as occupied *)
  match root with
  | VNum(_) -> ()
  | VLoc(l) ->
    (* check to see if we've already marked this as free; if not, recurse *)
    if (Array.get heap.free_list l) = Free then
      (Array.set heap.free_list l Occupied;
       gc (Array.get heap.heap l) heap)

let rec interp_c (heap:heap) (env: microcenv) (microc:microc) : value =
  match microc with
  | Let {id; binding; body} ->
    let bindv = interp_c heap env binding in
    let new_env = StringMap.add id bindv env in
    interp_c heap new_env body
  | Var(s) ->
    (match StringMap.find_opt s env with
     | Some(v) -> v
     | None -> raise Runtime)
  | Malloc ->
    let idx = find_free_idx heap in
    (* update free list *)
    Array.set heap.free_list idx Occupied;
  | Set {loc; value} ->
    let l = (match interp_c heap env loc with
          VLoc(v) -> v
        | VNum(_) -> raise Runtime) in
    let v = interp_c heap env value in
    Array.set heap.heap l v;
  | Deref { loc } ->
    let l = (match interp_c heap env loc with
       VLoc(v) -> v
     | VNum(_) -> raise Runtime) in
    Array.get heap.heap l
  | Num(n) -> (VNum(n))
  | Gc ->
    (* mark all heap locations as free *)
    Array.iteri (fun idx _ -> Array.set heap.free_list idx Free) heap.free_list;
    (* then, for every root in our environment, mark it as occupied *)
    StringMap.iter (fun _ value -> gc value heap) env;
  | Free(exp) ->
    let l = (match interp_c heap env exp with
          VLoc(v) -> v
        | VNum(_) -> raise Runtime) in
    Array.set heap.free_list l Free;

  • Question: what happens if we don’t do the check to see if a heap allocation has already been freed?

Beyond tracing garbage collection

  • Many of today’s languages deploy sophisticated garbage collection schemes
    • Python’s is documented here
    • Java’s is documented here
    • OCaml’s is documented here
    • JavaScript’s V8 garbage collector is documented here
    • This textbook is considered authoritative on standard algorithms and is a good starting point:
  • Knowledge of the subtleties and nuances of your language’s garbage collection scheme can sometimes be critical for performance in practice.
  • There are many challenges with practical garbage collection that our little implementation here hides:
    • Compaction and fragmentation: freeing up used space and handling heterogeneous allocation sizes
    • Concurrency: handling multi-threaded programs that allocate memory
    • Latency: handling real-time applications
    • Finalizers: cleaning up dynamically allocated resources besides memory
  • We’ll briefly discuss some of these things, but for more details, you are encouraged to study existing garbage collectors at the links above

Reference-counted garbage collection

  • Reference-counting garbage collection is a simple and lightweight approach to garbage collection that stores alongside each heap-allocated value how many references to it exist
  • Implementation is relatively simple in principle:
    • Every time a location (or a location eventually points to that location) is referenced, increase its reference count.
    • Every time a references goes out of scope, decrement its reference count.
    • Collect the reference if its count is 0.
  • Benefits of reference-counting:
    • Does not require “stopping the world” to perform mark-and-sweep: allocations can be collected as soon as reference count hits zero.
    • Does not require root identification: makes it easy to embed and interact with unsafe languages
    • Predictable clean-up time: you can provide (some) guarantees about when collection will be performed.
  • Downsides of reference counting:
    • Requires constant updating and maintenance of counters: terrible for multithreading, adds runtime overhead.
    • Leaks cycles.
  • The Swift language uses reference counting for memory management

Generational garbage collection

  • Generational garbage collection is probably the most widely-used garbage collection scheme
    • Used by Python, Java, OCaml, JavaScript, etc.
  • Leverages the principle that recent allocations are more likely to be short-lived, so maintains multiple generations: separate heaps that have different expectations around how long allocations live.
  • The nursery is where new values are allocated: it is assumed that most values in the nursery will be collected quickly.
  • During garbage collection, allocations in the nursery are moved into a long-lived store that has more infrequent garbage collection cycles.