HW18: Word Frequencies

25 points; due Mon 6/2 @ 11am.

Goals

To write and use a Dictionary implementation.

Setup and Requirements

Download Dictionary.java, DictPair.java, and Set.java.

This is optionally a partner assignment.

Your Tasks

  1. Write a hash-table–based Dictionary implementation, using separate chaining, called HashDictionaryImplementation. A few notes:

    1. Two methods defined in the Dictionary interface return sets, so you'll need to have a Set implementation around for that. You have a few options:
      • If you're confident in the BST-based Set implementation you wrote for the last assignment, you may use that. Note that you may run into trouble with the fact that your implementation requires Comparable (ordered) keys, while the Dictionary interface places no such restriction on key types. It is acceptable, if you'd like, for your Dictionary implementation class to require Comparable keys (even though that's not needed for hashing), just so you can deal with your Set implementation.
      • Instead, you may write a simple “wrapper class” that implements the Set interface. Under the hood, your class should just store a single private instance variable: a Java HashSet. All the methods you write will simply be calls to the Java class's methods. Call this class JavaSetWrapper.
      • You may also write a hash-based Set implementation of your own, if you'd like.
    2. You will need to create a private inner class that implements the DictPair interface, in order to return the correct type of object for getEntrySet().
    3. In order to implement separate chaining, you'll also need to write your own private inner class for nodes. This can get really ugly when it comes to dealing with Java's funky handling of generics, so here's what you should do:

      • Declare and initialize your hash table as an array of Objects, even though it's actually an array of nodes.
      • Whenever you read a value out of the array, in order to use it as a node, you must cast it as such (since Java will just think of it as an Object by default). You still need to precede the assignment with @SuppressWarnings("unchecked") in order to avoid compiler warnings.
      • When you want to write a value (a node) to the array, you don't have to do anything special.

      I've written some example code to demonstrate how this should work.

    4. When it comes to resizing your dictionary, your table size must always be prime, and should approximately double every time you resize (or otherwise the amortized cost of resizing is no longer constant). The right way to do this is to just store a precomputed sequence of prime table sizes, worked out in advance by some smart person, as a static constant. Here's the code for that (make sure to attribute your sources!):

          // Nice prime numbers for growing array sizes, from
          // http://planetmath.org/goodhashtableprimes
          // (augmented with a final entry computed by Wolfram Alpha).
          private static final int[] PRIMES = {
              13, 23, 53, 97, 193, 389, 769, 1543, 3079, 6151, 12289, 24593, 49157,
              98317, 196613, 393241, 786433, 1572869, 3145739, 6291469, 12582917,
              25165843, 50331653, 100663319, 201326611, 402653189, 805306457,
              1610612741, 3221225473 };
      

      Note that the final value (which I computed myself) is almost equal to $1.5 \cdot 2^{31}$. This is ¾ of the way to $2^{32} - 1$, which is the maximum value of an unsigned integer, and therefore the maximum possible size of an array on a 32-bit machine architecture. On a 32-bit architecture, Java uses at least 64 bits to store each Integer (32 for the value, and 32 more for the reference to the value), so you'd run out of memory before you could fill an array of size 3221225473. Therefore this value is definitely big enough.

  2. Write a main method in HashDictionaryImplementation to test the functionality of your Dictionary implementation. In particular, make sure that you test edge cases, and that you add enough data at some point to force your hash table to grow.
  3. Use your Dictionary to analyze the text of a document.

    Write a separate class, WordFreqs, whose main method takes a filename from the command line: a text file to analyze. Read through the file word-by-word, ignoring case, punctuation, etc. just like in the last assignment (so “Don't”, for example, would be treated as two words: “don” and “t”). Use your dictionary to keep track of the number of occurrences of each unique word in the file.

    Then print out the top 20 most common words in the file, each paired with its normalized frequency: the proportion of all the words in the file that were occurrences of this particular word. (So, for example, if “don” shows up 5 times in a file with 500 words, its normalized frequency is 0.01.)

    Make this a polished program: it should fail gracefully and provide helpful feedback to the user where necessary.

Submission and Grading

Zip up your HashDictionaryImplementation.java and WordFreqs.java (and any additional ADT or implementation files needed for either of these) into hw18.zip and submit it on Moodle.