- Logistics:
- Homework due tonight, next one coming out
- Outcomes for today:
- You should be able to identify if a language is memory safe, and be able to write programs that demonstrate memory safety errors in languages that are not memory safe.
- You should understand what the purpose of a garbage-collector is and how to implement a simple tracing-based garbage collection scheme.
Micro-C and memory safety
- Today our goal is to study memory safety. To do this, we will introduce a tiny C-like language with
free
andmalloc
:
type microc =
| Let of { id: string; binding: microc; body: microc }
| Var of string
| Malloc
| Free of microc
| Set of {loc: microc; value: microc}
| Deref of {loc: microc}
| Num of int
- The semantics of Micro-C are given as follows:
(* the state of micro-c programs are similar to heap-manipulating program's we've seen so far,
except now we maintain a **free-list** which tracks which portions of the heap are occupied. *)
env heap freelist
let x = 10 in x, [], [-1, -1, ...], [Free, Free, Free, ...]
--> x, [x ↦ 10], []
--> 10
(* malloc generates a fresh (unused) location but does *not* initialize the memory and sets that
location to occupied in the free-list *)
malloc, [], [-1, -1, ...], [Free, Free, Free, ...]
--> loc 0x0, [], [-1, -1, ...], [Occupied, Free, Free, ...]
(* free deallocates a location to make it available for later. evaluates to the
number 0, for completeness *)
free 0x0, [], [-1, -1, ...], [Occupied, Free, Free, ...]
--> 0, [], [-1, -1, ...], [Free, Free, Free, ...]
(* set updates a location in memory, returns 0 for completeness *)
set 0x0 10, [], [-1, -1, ...], [Free, Free, Free, ...]
--> 0, [], [10, -1, ...], [Free, Free, Free, ...]
(* deref fetches a value stored in the heap *)
deref 0x0, [], [-1, -1, ...], [Free, Free, Free, ...]
--> -1
- Given these semantics, we can make an interpreter:
let heap_size = 100
let empty_heap () = { free_list = Array.make heap_size Free; heap = Array.make heap_size (VNum (-1))}
module StringMap = Map.Make(String)
type microcenv = value StringMap.t
(* out of memory exception *)
exception Oom
(* out of memory exception *)
exception Runtime
(* find a free index in the free_list or raise an Oom exception *)
let find_free_idx (h:heap) : int =
let rec helper (h:heap) (idx:int) : int =
if (idx >= Array.length h.free_list) then raise Oom else
if (Array.get h.free_list idx) = Free then idx else helper h idx in
helper h 0
let rec interp_c (heap:heap) (env: microcenv) (microc:microc) : value =
match microc with
| Let {id; binding; body} ->
let bindv = interp_c heap env binding in
let new_env = StringMap.add id bindv env in
interp_c heap new_env body
| Var(s) ->
(match StringMap.find_opt s env with
| Some(v) -> v
| None -> raise Runtime)
| Malloc ->
let idx = find_free_idx heap in
(* update free list *)
Array.set heap.free_list idx false;
VLoc(idx)
| Set {loc; value} ->
let l = (match interp_c heap env loc with
VLoc(v) -> v
| VNum(_) -> raise Runtime) in
let v = interp_c heap env value in
Array.set heap.heap l v;
VNum(0)
| Deref { loc } ->
let l = (match interp_c heap env loc with
VLoc(v) -> v
| VNum(_) -> raise Runtime) in
Array.get heap.heap l
| Num(n) -> (VNum(n))
| Free(exp) ->
let l = (match interp_c heap env exp with
VLoc(v) -> v
| VNum(_) -> raise Runtime) in
Array.set heap.free_list l Free;
VNum(0)
- Notice the semantics of Micro-C do not prevent us from doing the following:
- Reading from memory before it is initialized.
- Writing to memory that has not been initialized.
- Freeing the same memory location more than once.
- Continuing to use a memory location after it has been freed.
- Broadly speaking, the above kinds of errors are called memory safety errors: errors that occurr due to accessing or updating invalid regions of memory.
- It may seem unbelievable, but real C code does not prevent any of these kinds of errors!
- Look at the following C program (use this website to run it):
int main()
{
int x = 10;
int z = 20;
int* y = &x;
*(y+1) = 5;
printf("x: %d, z: %d", x, z);
return 0;
}
- What do you think this program prints? The answer will probably surprise you!
- It prints
x: 10, z: 5
!!! The value forz
was changed even though we never explicitly created a pointer toz
.
- It prints
- If memory safety is so dangerous, why do our languages permit these operations in the first place?
- By far the primary reason is performance: sometimes it is critical to do low-level memory updates to speed up certain algorithms or computations
Compiling with memory safety
- So, how do we design languages with dynamic allocation and pointers that avoid memory errors? This is a language design problem
- We call languages that prevent all memory-safety errors memory-safe languages. This is a broad definition and is difficult to make precise (and many purportedly memory-safe languages do permit ways of performing memory-unsafe computation)
- To study this question, we will study how to compile a memory-safe language into a memory-unsafe language and the challenges involved with doing that.
- Here is our tiny memory safe language:
type ast =
Let of string * ast * ast
| Unbox of ast
| Box of string
| Set of string * ast
| Var of string
| Num of int
- What makes it a memory safe language?
- Type-system (not shown, but assume it has the usual type-system for mutable references we’ve seen in class) prevents unboxing of non-references
- No ability to manipulate pointers to dereference invalid memory addresses
- Syntactic structure of
box
requires providing an initial value for heap allocation
- A simple compiler:
let rec compile_leaky (ast:ast) (f:int ref) : microc =
match ast with
| Box(e) ->
let e_v = compile_leaky e f in
(* whenever we allocate a new value, garbage collect *)
let loc_name = fresh_name f in
Let{id=loc_name; binding=Malloc;
body = Let {id="_"; binding=Set { loc=Var(loc_name); value=e_v };
body=Var(loc_name) }}
| Num(n) -> Num(n)
| Var(s) -> Var(s)
| Let(id, binding, body) ->
Let {id=id; binding=compile_leaky binding f; body= compile_leaky body f}
| Set(loc, value) ->
Set { loc = Var(loc); value = compile_leaky value f }
| Unbox(e) ->
Deref { loc = compile_leaky e f }
- Example outputs:
> compile_leaky Let("x", Box(Num(10)), Unbox(Var("x"))) (ref 0);;
- : microc =
Let
{id = "x";
binding =
Let
{id = "fresh0"; binding = Malloc;
body =
Let
{id = "_"; binding = Set {loc = Var "fresh0"; value = Num 10};
body = Var "fresh0"}};
body = Deref {loc = Var "x"}}
- Notice: this compilation procedure only ever allocates memory, it never frees it! Memory that is allocated but never freed is called a memory leak.
- Memory leaks are not generally regarded as memory safety errors since they generally do not lead to dangerous systems behavior; they mostly affect performance (and, if you eventually run out of memory due to a memory leak, they can cause your application or even system to crash)
MarkGarbage collection
- One of the performance benefits of low-level memory management provided by micro-C is the ability to have fine-grained control over the amount of allocated memory by freeing memory when it is no longer needed.
- Freeing memory is a common source of memory safety issues, so it is not compatible with our memory-safe language.
- How do we efficiently utilize memory in memory-safe languages? The answer is garbage collection, which refer to schemes for reclaiming unused memory when it is no longer needed
- There are many ways of implementing garbage collection, and it is a very active area of language development. We will cover a simple strategy called tracing (or mark-and-sweep) garbage collection.
- First, we will extend the AST of Micro-C with a
Gc
command that triggers a garbage collection when it is executed:
type microc =
| Let of { id: string; binding: microc; body: microc }
| Var of string
| Malloc
| Gc
| Free of microc
| Set of {loc: microc; value: microc}
| Deref of {loc: microc}
| Num of int
- When compiling to Micro-C, a memory-safe language can emit a
Gc
command to trigger a garbage collection cycle and clean up free memory- In practice, there are many heuristics and strategies for choosing when to trigger garbage collection, and the performance of memory-safe languages can be very sensitive to this choice.
- We can modify our compiler to emit garbage collection commands whenever a boxed value is allocated:
let rec compile_gc (ast:ast) (f:int ref) : microc =
match ast with
| Box(e) ->
(* whenever we allocate a new value, garbage collect *)
let e_v = compile_leaky e f in
let loc_name = fresh_name f in
let res : microc = Let{id=loc_name; binding=Malloc;
body = Let {id="_"; binding=Set { loc=Var(loc_name); value=e_v };
body=Var(loc_name) }} in
let body : microc = Let { id = "_"; binding = Gc; body = Var("res")} in
Let { id = "res"; binding = res; body=body }
| Num(n) -> Num(n)
| Var(s) -> Var(s)
| Let(id, binding, body) ->
Let {id=id; binding=compile_gc binding f; body= compile_gc body f}
| Set(loc, value) ->
Set { loc = Var(loc); value = compile_gc value f }
| Unbox(e) ->
Deref { loc = compile_gc e f }
- Now, how do we implement the
Gc
instruction? - Key idea:
- At the point of garbage collection, identify roots, which are in-scope locations in the program
- Then, update the free-list with all locations that are reachable from the roots
- In micro-c, root identification is quite simple: we can inspect the current environment to get all in-scope variables
- In languages more like assembly, it can be much trickier to do root identification!
- Ultimately, this is a straightforward graph reachability algorithm, which we can implement as follows:
let rec gc (root:value) (heap:heap) : unit =
(* traverse the heap beginning from `root` and mark every reachable address as occupied *)
match root with
| VNum(_) -> ()
| VLoc(l) ->
(* check to see if we've already marked this as free; if not, recurse *)
if (Array.get heap.free_list l) = Free then
(Array.set heap.free_list l Occupied;
gc (Array.get heap.heap l) heap)
let rec interp_c (heap:heap) (env: microcenv) (microc:microc) : value =
match microc with
| Let {id; binding; body} ->
let bindv = interp_c heap env binding in
let new_env = StringMap.add id bindv env in
interp_c heap new_env body
| Var(s) ->
(match StringMap.find_opt s env with
| Some(v) -> v
| None -> raise Runtime)
| Malloc ->
let idx = find_free_idx heap in
(* update free list *)
Array.set heap.free_list idx Occupied;
VLoc(idx)
| Set {loc; value} ->
let l = (match interp_c heap env loc with
VLoc(v) -> v
| VNum(_) -> raise Runtime) in
let v = interp_c heap env value in
Array.set heap.heap l v;
VNum(0)
| Deref { loc } ->
let l = (match interp_c heap env loc with
VLoc(v) -> v
| VNum(_) -> raise Runtime) in
Array.get heap.heap l
| Num(n) -> (VNum(n))
| Gc ->
(* mark all heap locations as free *)
Array.iteri (fun idx _ -> Array.set heap.free_list idx Free) heap.free_list;
(* then, for every root in our environment, mark it as occupied *)
StringMap.iter (fun _ value -> gc value heap) env;
VNum(0)
| Free(exp) ->
let l = (match interp_c heap env exp with
VLoc(v) -> v
| VNum(_) -> raise Runtime) in
Array.set heap.free_list l Free;
VNum(0)
- Question: what happens if we don’t do the check to see if a heap allocation has already been freed?
Beyond tracing garbage collection
- Many of today’s languages deploy sophisticated garbage collection schemes
- Python’s is documented here
- Java’s is documented here
- OCaml’s is documented here
- JavaScript’s V8 garbage collector is documented here
- This textbook is considered authoritative on standard algorithms and is a good starting point:
- Knowledge of the subtleties and nuances of your language’s garbage collection scheme can sometimes be critical for performance in practice.
- There are many challenges with practical garbage collection that our little implementation here hides:
- Compaction and fragmentation: freeing up used space and handling heterogeneous allocation sizes
- Concurrency: handling multi-threaded programs that allocate memory
- Latency: handling real-time applications
- Finalizers: cleaning up dynamically allocated resources besides memory
- We’ll briefly discuss some of these things, but for more details, you are encouraged to study existing garbage collectors at the links above
Reference-counted garbage collection
- Reference-counting garbage collection is a simple and lightweight approach to garbage collection that stores alongside each heap-allocated value how many references to it exist
- Implementation is relatively simple in principle:
- Every time a location (or a location eventually points to that location) is referenced, increase its reference count.
- Every time a references goes out of scope, decrement its reference count.
- Collect the reference if its count is 0.
- Benefits of reference-counting:
- Does not require “stopping the world” to perform mark-and-sweep: allocations can be collected as soon as reference count hits zero.
- Does not require root identification: makes it easy to embed and interact with unsafe languages
- Predictable clean-up time: you can provide (some) guarantees about when collection will be performed.
- Downsides of reference counting:
- Requires constant updating and maintenance of counters: terrible for multithreading, adds runtime overhead.
- Leaks cycles.
- The Swift language uses reference counting for memory management
Generational garbage collection
- Generational garbage collection is probably the most widely-used garbage collection scheme
- Used by Python, Java, OCaml, JavaScript, etc.
- Leverages the principle that recent allocations are more likely to be short-lived, so maintains multiple generations: separate heaps that have different expectations around how long allocations live.
- The nursery is where new values are allocated: it is assumed that most values in the nursery will be collected quickly.
- During garbage collection, allocations in the nursery are moved into a long-lived store that has more infrequent garbage collection cycles.