Files and Text

Restrictions

You must work on this homework with your assigned partner.
You may use either ISL or ISL with Lambda as the language for this assignment.
You will need to load the following libraries and nothing else:
```
(require 2htdp/batch-io)
(require "io-extra.rkt")
```
The “io-extra.rkt” library is a file that you need to download (click here) and place in the same directory as your homework file.

You must not use other libraries without explicit permission from the instructor.

Introduction

Project Gutenberg is the oldest, free digital library, and pre-dates the modern Internet. In this assignment, we will work with the Complete Works of William Shakespeare, downloaded from Project Gutenberg. You should download and save this file (which is named [pg100.txt], i.e., Project Gutenberg Book #100). We have made some minor changes to this file to simplify the assignment, so do not download it directly from the Project Gutenberg website.

We strongly recommend reading Chapter 6 of Speech and Language Processing by Martin and Jurafsky. The book is freely available online. At the conclusion of this assignment, we will build something similar to a term-document matrix (Section 6.3) and calculate cosine-similarity (Section 6.4). You can skip Section 6.5 and beyond.

Homework 4 (Part 1): Several Helper Functions

Complete the following function designs. We have provided signatures, purpose statements, and some test cases. You should test all functions.

;; take-n : Int [List-of X] -> [List-of X]
;; (take-n n alist) produces the first n elements of the list in the order they appear.
;; n must be non-negative, and it produces the whole list if n exceeds the length of the
;; list.

;; drop-n : Int [List-of X] -> [List-of X]
;; (drop-n n alist) produces alist but with the first n elements removed.
;; n must be non-negative, and drop-n produces the empty list if n exceeds the length of the
;; list.

;; take-while : (X -> Bool) [List-of X] -> [List-of X]
;; (take-while pred alist) produces the longest prefix of alist, where all
;; elements satisfy pred.

;; drop-while : (X -> Bool) [List-of X] -> [List-of X]
;; (drop-while pred alist) produces a suffix of alist that we get by dropping
;; the longest prefix of alist that satisfies pred.

;; group-by : (X X -> Boolean) [List-of X] -> [List-of [NE-List-of X]]
;; (group-by same-group? alist) splits alist into groups by comparing
;; consecutive elements to check if they should be in the same group.
   
(check-expect (group-by = (list 10 10 20 20 20 10))
              (list (list 10 10) (list 20 20 20) (list 10)))

Submit Homework 4 before proceeding to Homework 5.

Homework 5 (Part 2): Text Processing

Reading and Writing Text

We will now design several functions to split pg100.txt into several files – one file per work. In the following functions:

You must not use direct recursion / the list template. You may use builtin list abstractions or the list abstractions you designed in Part 1. You may also define helper functions, but they must follow the same restrictions.
You must define small, hand-written test cases. To do so, you will need to craft small example inputs that are similar in shape to the actual input file. One way to do this is to take the input file and delete most plays, and shorten each play to a single line.

Design the following functions:

;; table-of-contents : [List-of String] -> [List-of String]
;; Extracts the table of contents from the lines of text of a book, formatted
;; similarly to pg100.txt.

(define-struct work [title contents])
;; A Work is a (make-work String [List-of String])
;; A (make-work title lines) represents a work of literature with the given
;; title and lines of text.
   
;; work-template : Work -> ?
(define (work-template w)
  (... (work-title w) ... (work-contents w) ...))

(define EX-WORK-1 (make-work "Hamlet" (list "To be or not to be.")))
(define EX-WORK-2 (make-work "Romeo and Juliet" (list "What's in a name?")))
   
;; extract-works : [List-of String] [List-of String] -> [List-of Work]
;; (extract-works toc lines) produces the works in a collection, given the list
;; of works extracted from the table of contents. The lines are formatted 
;; similar to pg100.txt.

```
;; extract-works-to-files : String -> [List-of String]
;; Given the path to a collected works, writes several files -- one for each
;; work -- and produces the list of file names.
```
You do not have to write unit tests for this function.

Hint: The write-lines function produces the name of the file it writes. We strongly recommend appending .txt to the file name so that your operating system knows to open it with a text editor (e.g., Notepad or TextEdit).

You can use the following cleaned plays to move on with the Homework. We will still be grading you on the above functions.

Text Analysis Functions

We will now design several functions to analyze works for similarity. At this point, there are no restrictions on what you can do, but we have some strong recommendations written below.

;; top-n-words : Int [List-of String] -> [List-of String]
;; (top-n-words n word-list) produces the n most frequent words in the given
;; word-list

We strongly recommend not using direct recursion. Use the list abstractions you wrote earlier and builtin list abstractions. Sorting is helpful.

;; word-frequency : [List-of String] [List-of String] -> [List-of Int]
;; (word-frequency top-words word-list) produces a list that counts
;; the number of occurrences of each word in top-words in word-list.
;; The list.

Hint: It will help to write a helper function that performs a parallel traversal between top-words and another list. The latter list cannot be word-list, but something derived from word-list.

;; document-distance : [List-of String] [List-of String] [List-of String] -> Number
;; (document-distance top-words word-list-1 word-list-2) calculates the
;; distance between the word frequency lists of the given files.

When the word frequency lists are (list x_0 x_1 ... x_n) and (list y_0 y_1 ... y_n), the distance is defined as (sqrt (+ (expt (- x_0 y_0) 2) (expt (- x_1 y_1) 2) ... (expt (- x_n y_n) 2)))

Given the functions above, we can use the following functions to work directly with files:

;; top-n-words/file : Int PathString -> [List-of String]
;; Applies top-n-words to the contents of a file.
(define (top-n-words/file n path)
  (top-n-words n (read-words path)))

;; document-distance/file : [List-of String] PathString PathString -> Number
;; Uses document-distance and word-frequency to calculate the distance between
;; two files.
(define (document-distance/file top-words path-1 path-2)
  (document-distance top-words 
    (read-words path-1)
    (read-words path-2)))

Analyzing Document Similarity

Shakespeare’s plays are typically classified as comedies, histories, or tragedies, e.g., as listed on Wikipedia. We might expect that histories will be most similar to other histories, etc.

Select a play from each category and calculate the distance between that play and all others (you may include the play itself). Produce the list of works sorted by distance order.
To solve (1), you have to pick a value for n to calculate distances. Try at least one other value.
Optional, recommended, not graded: Examine the list that you get from top-n-words. You may get better performance if you exclude some very common words (e.g., articles and prepositions).
Optional, not graded: Try this on a different collection of works.