HW10: Timing Collections : CS 201

To study the performance characteristics of a handful of Java's built-in ADT implementations.

You may optionally do this assignment with a partner. This assignment builds on the ideas of the previous timing assignment to examine four different ADT implementations, as provided by the Java standard library.

We would like to compare the performance of four different ways of implementing the functionality of the Set ADT. In fact we're actually just going to abuse other ADTs, Lists and Dictionaries to be specific, to mimic one aspect of the behavior of sets: namely, checking whether a given item is in the set.

While our purpose is to compare the performance of these different implementations, we need some “benchmark” task that will put them to the test. Our chosen task is a mundane one: spell-checking. We'd like to populate our structure with $n$ “valid” words, and then look up each of $k$ words from some test text to see which of them is valid. (We use the term “valid” here just to indicate that they're members of the search structure, nothing more. Later on, you'll see that we need to add nonsense words to the valid set in order to get good timing data.) The basic task for a single implementation is the following:

Read in a (potentially enormous) list of $n$ unique valid words.
Create a “set” containing those valid words.
Read in a separate (enormous) list of $k$ test words, possibly with repeats.
For each of the $k$ test words, check to see whether the set contains that word.

You will time this last step (and only this last step!), and investigate the relationship between $n$, $k$, and search time.

The Implementations

Again, note that for this assignment you will work with data structures defined in the Java standard library, so there's no additional helper code to download. However, the interfaces are a little different from those we've worked with in class so far, so it'll be important to read the documentation. Here are the four structures you will use to simulate the set of valid strings:

ADT Interface Implementation Class Notes

List<String> ArrayList<String> Entries in random order, using linear search to find entries.

List<String> ArrayList<String> Entries in sorted order, using binary search to find entries.

Map<String, Object> TreeMap<String, Object> Store entries as keys, and use the containsKey() method to find entries.

Map<String, Object> HashMap<String, Object> Store entries as keys, and use the containsKey() method to find entries.

ADT Interface	Implementation Class	Notes
`List<String>`	`ArrayList<String>`	Entries in random order, using linear search to find entries.
`List<String>`	`ArrayList<String>`	Entries in sorted order, using binary search to find entries.
`Map<String, Object>`	`TreeMap<String, Object>`	Store entries as keys, and use the `containsKey()` method to find entries.
`Map<String, Object>`	`HashMap<String, Object>`	Store entries as keys, and use the `containsKey()` method to find entries.

The dictionary types mentioned above are of course designed to map keys to values. But really, we just care about the keys, and how long it takes to find a particular one. Thus you should set all the values in your dictionaries to null.

What to Hand In

Hand in via Moodle:

A report describing the timing experiments you performed, your timing data, and your interpretation of what your results say about the relationship between $n$, $k$, choice of implementation / search algorithm, and running time. Your report should be a PDF file no longer than 5 pages including charts and graphs, if any.

You don't need to look too deeply into why the particular structures give you the performance that you see, but you should try to determine, from your observations, what the asymptotic running-time complexity of a single lookup is in each of the four implementations, as well as the complexity of $k$ lookups in a row. (That is, you should conjecture about the “big O” of each of these operations, and provide arguments from the timing data in support of your conjectures.)
Source code you used to collect your timing data. Your source code should be well-commented and provide ample guidance to an outside user/reader about how it's structured.

Your program, called with no arguments, must run the full suite of timing tests on which your report is based — every combination of $n$ and $k$ that you measured. (The only exception is this: if any tests take more than a minute, please include their invocations but comment them out, so that the grader can see how you got the data but doesn't have to wait for a long time for your program to run.) This means that your “test suite” should be hard-coded into the program, which is sort of contrary to the software design practice that you've learned in 111 and 201 so far. In an instance like this, though, it's okay.

The only output of your program (when called with no command-line arguments) must be the timing-data summary; for each test, report the values of $n$ and $k$ and which structure you're testing, followed by the running time.

You will be graded on the depth of your investigation and the clarity of your writeup, along with the quality of your code.

Suggestions and offerings

What we care about measuring in this assignment is search time, not the time that it takes to parse files or create our data structures or anything else like that. Do all your prep work up front (making both the valid-word set and a simple list of test words), and then time just a simple loop over the list of $k$ test words, looking each one up in the valid set. It is critical that you do not include any additional processing in the portion of the code that you time; don't modify your sets in any way, don't do any string parsing, and definitely don't print anything out.
What do we mean by “enormous” for $n$? The maximum value of $n$ that you examine should be at least 500,000. To get you started, your source of unique valid words could be this list of 267,754 words in alphabetical order (or the same word list in random order).
You may have noticed in the last point that the list of valid words contains about half as many words as you'll need to really max out $n$. You can make $n$ larger by duplicating each valid word by appending some nonsense string to the end of it. So, for example, your valid word set might include the words “AALS” and “ABAC”, and also the fake words “AALSAAAAA” and “ABACAAAAA”. Make sure that the sorted-list implementation is actually sorted when you include these fake words; it's not necessarily the case that you can just interleave them! Instead, use Java's Collections.sort() method. (Alternately, you could write out the duplicated word list to a separate file, and then use the Unix sort command to save a sorted copy. This will save you from doing repeated work in multiple experiments.)
You'll need pretty big values of $k$ in order to see noticeable distinctions between the really fast implementations. If the performance seems comparable between them, try increasing $k$ (and maybe $n$). It's possible that you might need a $k$ as high as 250,000 or more. As a source of test words, you could use War and Peace or some other huge book you can download from Project Gutenberg. (To use this latter list, pick a title you like and then search for it on Project Gutenberg. Make sure to download the plain-text version of your chosen book!)
You need to test many combinations of $n$ and $k$ values over a large range; above I just described maximum values, but you should also try timing down into the low thousands, or whatever puts your performance at close to half a second. To do this, you may take subsets of some really big input data (word list and book text), or, for the book text, you could just duplicate a shorter book as many times as you need. This is okay to do since the test words need not be unique.
It really doesn't matter what the “valid” words are, nor, for that matter, what the test words are. So it's okay if, for a measurement with small $n$, all your valid words begin with A or B, for example.
The syntax for declaring and initializing a data structure using the interfaces and implementations that Java provides is identical to that which you've been using all along with our “mystery” implementations. So, for example, here's how you make a new List with the ArrayList implementation:
```
List<String> validWords = new ArrayList<String>();
```
At the top of the file, you would also need to import these types:
```
import java.util.List;
import java.util.ArrayList;
```

Create a zip file called hw10.zip containing your writeup (which must be a PDF file, no exceptions!), your source code, and any input files your source code depends on. Submit the zip file on Moodle.

Start early, ask lots of questions, and have fun!

Grading

Implementation — 7 points
Code design and style — 2 points
Thoroughness of investigation — 8 points
Clarity of writeup — 8 points

HW10: Timing Collections

Goals

Your Task

The Implementations

What to Hand In

Suggestions and offerings

Submission and Grading

Grading