HW10: Timing Collections

25 points; due Fri 5/2 @ 11am.

Goals

To study the performance characteristics of a handful of Java's built-in ADT implementations.

Your Task

You may optionally do this assignment with a partner. This assignment builds on the ideas of the previous timing assignment to examine four different ADT implementations, as provided by the Java standard library.

We would like to compare the performance of four different ways of implementing the functionality of the Set ADT. In fact we're actually just going to abuse other ADTs, Lists and Dictionaries to be specific, to mimic one aspect of the behavior of sets: namely, checking whether a given item is in the set.

While our purpose is to compare the performance of these different implementations, we need some “benchmark” task that will put them to the test. Our chosen task is a mundane one: spell-checking. We'd like to populate our structure with $n$ “valid” words, and then look up each of $k$ words from some test text to see which of them is valid. (We use the term “valid” here just to indicate that they're members of the search structure, nothing more. Later on, you'll see that we need to add nonsense words to the valid set in order to get good timing data.) The basic task for a single implementation is the following:

  1. Read in a (potentially enormous) list of $n$ unique valid words.
  2. Create a “set” containing those valid words.
  3. Read in a separate (enormous) list of $k$ test words, possibly with repeats.
  4. For each of the $k$ test words, check to see whether the set contains that word.

You will time this last step (and only this last step!), and investigate the relationship between $n$, $k$, and search time.

The Implementations

Again, note that for this assignment you will work with data structures defined in the Java standard library, so there's no additional helper code to download. However, the interfaces are a little different from those we've worked with in class so far, so it'll be important to read the documentation. Here are the four structures you will use to simulate the set of valid strings:

ADT InterfaceImplementation ClassNotes
List<String>ArrayList<String>Entries in random order, using linear search to find entries.
List<String>ArrayList<String>Entries in sorted order, using binary search to find entries.
Map<String, Object>TreeMap<String, Object>Store entries as keys, and use the containsKey() method to find entries.
Map<String, Object>HashMap<String, Object>Store entries as keys, and use the containsKey() method to find entries.

The dictionary types mentioned above are of course designed to map keys to values. But really, we just care about the keys, and how long it takes to find a particular one. Thus you should set all the values in your dictionaries to null.

What to Hand In

Hand in via Moodle:

  1. A report describing the timing experiments you performed, your timing data, and your interpretation of what your results say about the relationship between $n$, $k$, choice of implementation / search algorithm, and running time. Your report should be a PDF file no longer than 5 pages including charts and graphs, if any.

    You don't need to look too deeply into why the particular structures give you the performance that you see, but you should try to determine, from your observations, what the asymptotic running-time complexity of a single lookup is in each of the four implementations, as well as the complexity of $k$ lookups in a row. (That is, you should conjecture about the “big O” of each of these operations, and provide arguments from the timing data in support of your conjectures.)

  2. Source code you used to collect your timing data. Your source code should be well-commented and provide ample guidance to an outside user/reader about how it's structured.

    Your program, called with no arguments, must run the full suite of timing tests on which your report is based — every combination of $n$ and $k$ that you measured. (The only exception is this: if any tests take more than a minute, please include their invocations but comment them out, so that the grader can see how you got the data but doesn't have to wait for a long time for your program to run.) This means that your “test suite” should be hard-coded into the program, which is sort of contrary to the software design practice that you've learned in 111 and 201 so far. In an instance like this, though, it's okay.

    The only output of your program (when called with no command-line arguments) must be the timing-data summary; for each test, report the values of $n$ and $k$ and which structure you're testing, followed by the running time.

You will be graded on the depth of your investigation and the clarity of your writeup, along with the quality of your code.

Suggestions and offerings

Submission and Grading

Create a zip file called hw10.zip containing your writeup (which must be a PDF file, no exceptions!), your source code, and any input files your source code depends on. Submit the zip file on Moodle.

Start early, ask lots of questions, and have fun!

Grading

This assignment originally designed by Jeff Ondich. Thanks for sharing, Jeff!