20 Memory Management
We continue on to a very important component of many programming languages: garbage collection. First we will define what garbage is in the context of our CESK machine. Then, we will explore the simple mark-and-sweep algorithm for finding and reclaiming garbage from the CESK machine. Finally, we will discuss the core concepts of more sophisticated garbage collection strategies found in realistic languages and discuss their design tradeoffs.
20.1 What is Memory Management?
So far we have not explored languages that free memory. Obviously this will not do: computers do not have infinite RAM, so we need a way to reclaim allocated memory to use for different things later on.
Allocated memory that is no longer needed by the program is called garbage. There are three general strategies that programming languages deploy to manage garbage:
Manual memory management. This is found in languages like C and C++, where all memory must be manually allocated and freed by the programmer. These languages typically also give programmers low-level access to memory representations (i.e., you can inspect and manipulate raw pointers).
Automated garbage collection. This is found in languages like Java, Python, and Racket. In these languages, a separate routine called a garbage collector is called at runtime to find and clean up memory that is no longer needed by the program.
Static garbage collection. This is found in languages like Rust, where a compile-time analysis is performed to determine when memory can safely be freed.
First we will discuss automated garbage collection, then touch on static garbage collection.
20.2 Mark and Sweep
The core idea of automated garbage collection (GC) is to inspect the state of the program to dynamically locate which portions of memory are or are not garbage. For us, the state of the program is nicely represented by the state of the CESK machine. So, we can use the SafeRefLang to explore garbage collection:
(define-type SafeRefLang (letE [id : Symbol] [e1 : SafeRefLang] [e2 : SafeRefLang]) (identE [id : Symbol]) (boxE [e : SafeRefLang]) (addE [e1 : SafeRefLang]) (unboxE [e : SafeRefLang]) (numE [n : Number]) (locE [l : Number])) (define-type Frame ;;; frame for (let x e1 e2) when e1 is not a value (let1F [E : Env] [id : Symbol] [e : RefLang]) (boxF [E : Env]) (unboxF [E : Env]) ;;; frame for (addE e1 e2) when e1 is not a number (add1E [E : Env] [e2 : RefLang]) ;;; frame for (addE n e2) when n is a number (add2E [E : Env] [n : Number])) (define-type State (state [C : RefLang] [E : Env] [S : (Vectorof Number)] [K : (Listof Frame)]))
Garbage collectors are usually run in a fully-automated fashion: at heuristically-chosen points during the program’s evaluation they will trigger and automatically run. The simplest kind of garbage collection strategy is called mark and sweep. The idea is relatively simple and has two stages:
First, mark all reachable locations in the memory searching the call stack and control register for any locations. The locations that are in-scope within the control register and call stack are called roots.
Then, sweep over the store to delete all allocations that are not reachable from a root.
Implementing this also requires a few other small changes, which you can explore in the implementation here.
20.3 A Taste of Static Garbage Collection
There is an alternative approach to memory management used in Rust that relies (essentially) on scope to determine when it is safe to collect memory. This may seem intuitive at first, but it is surprisingly tricky to get right!
Let’s see some small examples of Rust programs to get a feel for how it deals with memory management. Here is a tiny Rust program that implements one of our favorite examples, the calculator language:
// this #[derive(Debug)] mumbu-jumbo makes it so that a data structure can be |
// easily printed in Rust. Ignore it for now. |
#[derive(Debug)] |
enum Calc { |
Num(i64), |
Add(Box<Calc>, Box<Calc>) |
} |
|
fn interp(c : Calc) -> i64 { |
match c { |
Calc::Num(n) => n, |
Calc::Add(l, r) => { |
let l_v = interp(*l); // Rust uses the `*` operator to unbox / dereference a dynamically allocated value |
let r_v = interp(*r); |
return l_v + r_v |
} |
} |
} |
|
fn main() { |
let my_calc = Calc::Num(10); |
// we use the {:?} to trigger a debug print |
println!("simple interp: {:?}", interp(my_calc)); |
|
let my_calc_2 = Calc::Add(Box::new(Calc::Num(10)), Box::new(Calc::Num(20))); |
println!("simple interp 2: {:?}", interp(my_calc_2)); |
} |
This example shows that Rust has many of the features we’re familiar with in Plait, albeit with syntax that looks quite similar to C, C++, or Java. As usual we define a definitional interpreter by casing on the abstract syntax tree.
The above program compiles and runs with no problem. But, here’s a surprise: what if we modify the main function to call interp twice?
fn main() { |
let my_calc = Calc::Num(10); |
// we use the {:?} to trigger a debug print |
println!("simple interp: {:?}", interp(my_calc)); |
|
// let's try printing the same thing again! |
println!("simple interp: {:?}", interp(my_calc)); |
|
let my_calc_2 = Calc::Add(Box::new(Calc::Num(10)), Box::new(Calc::Num(20))); |
println!("simple interp 2: {:?}", interp(my_calc_2)); |
} |
|
Quite surprisingly, this raises a compile error!
Compiling playground v0.0.1 (/playground) |
error[E0382]: use of moved value: `my_calc` |
--> src/main.rs:26:44 |
| |
21 | let my_calc = Calc::Num(10); |
| ------- move occurs because `my_calc` has type `Calc`, which does not implement the `Copy` trait |
22 | // we use the {:?} to trigger a debug print |
23 | println!("simple interp: {:?}", interp(my_calc)); |
| ------- value moved here |
... |
26 | println!("simple interp: {:?}", interp(my_calc)); |
| ^^^^^^^ value used here after move |
| |
note: consider changing this parameter type in function `interp` to borrow instead if owning the value isn't necessary |
--> src/main.rs:9:15 |
| |
9 | fn interp(c : Calc) -> i64 { |
| ------ ^^^^ this parameter takes ownership of the value |
| | |
| in this function |
Rust provides a very interesting and informative error message here: it suggests that we "change the parameter in interp to borrow if owning the value isn’t necessary". What on earth does that mean? The answer has to do with how Rust performs automatic safe memory management without garbage collection.
How does Rust know when to collect heap-allocated values like those created by Box::new? There are three rules:
Each value in Rust is bound to a particular identifier called an owner.
There can only be one owner for a particular value at a time, and the owner has special privileges for interacting with the data (which we will see).
When the owner goes out of scope, the value will be dropped (which, depending on its type, may mean different things: when a store-allocated value is dropped, that means its memory is freed).
Notice that when a value is dropped in rust is a lexical property of its owner: this means it can be determined by the compiler.
Let’s see what this means by inspecting a smaller example. Consider the following snippet:
fn main() { |
let x = Box::new(10); |
println!("my boxed value: {}", *x); // prints 10 |
// x is freed at the end of the function because it goes out of scope |
} |
In the above example x is the owner of the store-allocated boxed value 10.
20.3.1 Move Semantics
Suppose we want to refer to the same data with more than 1 name. We need to do this all the time, for example when calling functions. What are our options?
1. Copy the data. This can be expensive, especially for large data-structures.
2. Pass a location. This sounds great, but then whose responsibility will it be to collect the memory? Should it be the function that is called, or will it be the responsibility of the caller?
Rust allows us to decide which of the above two strategies to use. Passing the responsibility to collect is called moving: you move the owner from one identifier to another. Once you move a value, you can no longer use it with the old name. This makes sense; the allocated memory may have been collected. See how the compiler raises an error for the following code, and think about where memory is allocated and freed:
fn my_func(y : Box<i64>) -> () { |
println!("gobble! {}", *y); |
// at this point y is freed |
} |
|
fn main() { |
let x = Box::new(10); |
// pass ownership of x to y in my_func |
my_func(x); |
println!("my boxed value: {}", *x); // we tried to access x, but it was freed by `my_func`! This is a memory error. |
} |
If we try to run this program, Rust will complain and fail to compile with a very helpful message. Why did Rust complain? Let’s break down what happened:
Initially x owns the location allocated by Box::new(10)
Then, when we call my_func, we move the ownership from x to y. Now, the memory is collected when the new owner y goes out of scope.
After my_func returns, back in main, we try to use x again. This raises an error, since x refers to a location that has been collected.
Suppose we want to keep x around after calling my_func. There are two things we can do: (1) Make a copy of the heap-allocated value and pass that as an argument to my_func, or (2) Retain ownership of the location. The first approach is easy:
fn my_func(y : Box<i64>) -> () { |
println!("gobble! {}", *y); |
// at this point y is freed |
} |
|
fn main() { |
let x = Box::new(10); |
let x2 = Box::new(*x); // create a copy of the value |
// pass ownership of x to y in my_func |
my_func(x2); |
println!("my boxed value: {}", *x); |
// at this point x is freed |
} |
Implementing option (2) requires a new language feature that is unique to Rust called borrowing.
20.3.2 Borrowing
Borrowing lets you preserve the original owner while having access to the value that it owns. How does it work? We need to change the signature for my_func and write our code slightly differently:
fn my_func(y : &i64) -> () { |
println!("gobble! {}", *y); |
// at this point y is *not freed* since `y` does not own the location |
} |
|
fn main() { |
let x = Box::new(10); |
// this function call **does not** transfer ownership: x still owns the location allocated by Box::new(10) |
my_func(&x); |
println!("my boxed value: {}", *x); |
// at this point x is freed |
} |
The type signature &i64 is a reference to an i64; think of it like a pointer. It can be dereferenced. However, passing a reference does not transfer ownership: this means you can create multiple references to the same value How do we know that the above code is safe? We know that the borrowed context does not outlive the owner’s scope. The Rust compiler performs fairly sophisticated analysis to enforce this liveness requirement
What is an example where a reference does not live long enough? Here is one, which if you’ve programmed in C or C++ will look familiar:
fn broken(x: int) -> &i64 { |
return &x; |
} |
This is an example of returning a reference to a local variable. Since x has type int, it is owned by this function call, meaning that it cannot be referred to outside of the scope of this function. The rust compiler will reject this program with the following error message:
this function's return type contains a borrowed value, but there is no value |
for it to be borrowed from |
So, you can see how static garbage collection requires a fair bit more effort than automated garbage collection, but is not error-prone like manual memory management is: the compiler will ensure that memory unsafe operations are prevented (at least in the portions of rust that do not use unsafe code; google that if you’re interested!).