In the remaining problems, you will develop a realistic program which implements a text-retrieval system similar to Google.
You will extend a basic program called Minipedia that maintains a database of articles on various subjects. We supply a folder of >2500 articles culled from Wikipedia (first paragraphs only), two test articles Test1 and Test2, and also the code to access this collection. The basic user interaction is provided, as well as a bare-bones implementation of the database as a dumb unordered list. You must extend the functionality in several interesting ways, and also provide a more intelligent data structure using a hash table.
For this problem, you will develop a hash-table based version of the DumbList.java
code.
Step One: Download the Code Distribution and unzip it in the folder where you will create your solution. It has the following files (the files marked with asterisk (*) you should NOT modify in any way):
articles.zip* -- A zipped archive of a folder called "articles" which contains 2511 Wikipedia articles in text form;
Article.java* -- A class to hold an article, including the title and the body of the article;
DatabaseIterator.java* -- A class which reads the text files in the articles folder which returns one article after another when next() is called;
DumbList.java -- An unordered array of articles, which you can add, delete, and lookup.
Minipedia.java -- The client code which provides a user interface and interacts with the other classes to provide basic functionality for adding, deleting, and looking up articles.
First, you should unzip the articles folder and browse around and get a sense for what an article looks like. Then say "wow, that's a lot of articles" (this step optional). Then compile and run the Minipedia.java client, and experiment with the interface, looking up articles (look at the folder to see titles---you will need to get the title exactly right to look it up). Finally, look through the code and understand what the program does.
Note on running Minipedia with the articles folder:
If you're having issues compiling Minipedia.java in OS X, there is a hidden file in the articles folder that must be deleted. Please open up Terminal.app and execute the following commands:
Type in 'cd ' (just two letters and a space) and then drag the articles folder into the Terminal app. Then (only after dragging the folder in) hit enter.
Then type 'rm .DS_Store
' and hit enter.
After this, Minipedia.java should compile.
Step Two: Next you must rewrite the DumbList.java
program as a program ArticleTable.java
which will store articles in a hash table using the technique of separate chaining, instead of using a "dumb list." It should have the same functionality (as specified below). You should initialize the table by simply inserting each article from the list returned by getArticleList(db) in line 65.
Your hash table should be implemented as described in lecture and in the Wiki article. Please use a table size which is a prime number around 2500 (Google "prime numbers" to see a list). You should use the article titles as the keys to insert into the table.
You may use any "good" hash function for String, e.g., the two found here work well:
Your public methods should include the following for inserting, deleting, and retrieval an article based on its exact title:
public void insert(Article a) { // insert a into the table using the title of a as the hash key .... } public void delete(String title) { .... } public Article lookup(String title) { // returns the article with the given title or null if not found ..... }
Hash Table Iterator In addition, you should write your ArticleTable class to include the following public methods for an iterator for the articles.
public void reset() { // initialize the iterator ...... } public boolean hasNext() { ..... } public Article next() { ...... }
These methods provide a way of traversing all the entries in the table, and will be used when searching the content of articles. For example, if you wanted to simply print out all articles from an ArticleTable T
, you could write:
T.reset(); while(T.hasNext()) { Article a = T.next(); System.out.println(a); }
To implement the iterator, you should add a pointer Node next2
to each Node:
private class Node{ String key; Article datum; Node next; // ptr to next node in bucket Node next2; // ptr to global list of all nodes .... constructor here .... }
Then you will have linked lists threaded as buckets with the next
pointer, and a "master list" of all nodes from a global head
pointer to the linked list threaded through the next2
pointers.
When you add a new Node to a linked list in a bucket, if it were added (not a duplicate) then you should add it to the front of the list pointed to by head:
void insert(String k, .....) { if(!member(k)) { A[hash(k)] = insert(A[hash(k), ....); head = new Node(....., head); // add new article to head of list // linked by next2 pointer } // else do nothing } void delete(String k, .....) { A[hash(k)] = delete(A[hash(k), ....); // delete from bucket head = deleteML(head, k); // remove node with k from master list }
Now you just have to write the iterator methods to "chain along" this "master list" of all nodes. Here is an example of an empty hash table (for simplicity, I've used ints for keys), and what happens after you insert a 5:
and here is the same table after adding 33, 45, and 16 (note how the master list is in the reverse order that we added key, since we just shove them in the front of the list, having already checked if they are in the data structure):
You, of course, will write a hash table which has String keys. Write your class in a file ArticleTable.java
and provide a unit test that verifies that it works properly, and can add, delete, and lookup articles, as well as enumerate all articles in the table using the iterator. You MUST test all the features of your code by writing your own unit test, which will be part of the grade. I strongly suggest that you modify the Minipedia code to use this new implementation and test it that works (this is not a requirement).
Problem B.4 Cosine Similarity Calculator (50 points)
In this problem you must write a program file TermFrequencyTable.java
which will store the words from two Strings (i.e., documents) in such a way that you can easily calculate the "cosine" similarity of the two.
Calculating the Cosine Similarity (we will cover this in lecture on 4/5)
To calculate the similarity of the two documents, you must take the union of the two word lists A and B (this is the total vocabulary of the two documents, minus any "black-listed words"---very common words are often not used to calculate similarity measures) and then calculate the new term frequency vector of each document with respect to this total list of words. For example, if document A is "the man with the hat ran up to the man with the dog" and document B is "a man with a hat approached a dog and a man", and the black list is {"the", "a", "to", "with", "up"}, then we have the following word lists which record how many times each word occurred in each document, and then compile a list of the total vocabulary (minus black-listed words):
A: (man,2), (up,1), (hat,1), (dog,1), (ran,1) B: (man,2), (approached,1), (hat,1), (dog,1) Total: approached, dog, hat, ran, man
Now we must calculate the term frequency vector for each test, indicating how many times each word in the total list occurs in each of the two documents (of course, some entries will be 0):
A: [0,1,0,1,2] B: [1,1,1,0,2]
Finally, we calculate the cosine similarity using the formula below and determine it to be 0.7715. This means the documents are very similar, but not identical (with respect to non-black-listed words). The cosine similarity is literally the cosine of the angle between two vectors in N-dimensional space (you'll learn about this in Linear Algebra), so it is always between 0 and 1.
(Note that the indices here start at 1 and end at n, instead of starting at 0 and ending at n-1; this is just a mathematician's way of writing sequences; you should always think in terms of 0 to n-1.) More explanation of the cosine similarity may be found here.
You should write this class using a separate-chaining table with a table size which is a relatively small prime around size 100 (the articles are relatively short and you need to store only the vocabulary of two articles). Your buckets should be composed of Nodes using something similar to the following:
// bucket node private class Node{ String term; int[] termFreq = new int[2]; // this gives the term frequency in each of two documents for this term Node next; ...... }
Your class should have the following two methods:
public void insert(String term, int docNum) {.... } // insert a term from a document docNum (= 0 or 1) into the table; if the term is not already present, add it // to the table with a termFreq of 1 for docNum, i.e., if p is the new node added, then p.termFreq[docNum] = 1. // If the term IS already there, just increment the appropriate termFreq value. double cosineSimilarity() { .... } // return the cosine similarity of the terms for the two documents stored in this table;
The basic idea here is to put all the terms in two strings (documents) into the table, and count how the number of occurrences of each term in each document. The total list of all words is just the total of all words in the hash table. At the end, you may return the cosine similarity of the terms contained in it; if you go through the entire table, the list of all (non-duplicate) terms is the term list, and the list of the term frequencies contained in termFreq[] contains the term frequency vectors.
In the formula above, the sequence Ai is the sequence of integers found in p.termFreq[0]
for all nodes p
, and the sequence Bi is the sequence found in p.termFreq[1]
. You need to calculate the product of these integers (for the numerator) and the squares of each (for the denominator). You must therefore use a method similar to that you developed in the last problem for iterating through all nodes in the table; you do not need to provide an interface to do this, but you will need to iterate through all the entries (two for loops should do it!).
Write your class in a file TermFrequencyTable.java
and provide a unit test that verifies that it works properly. Test three examples, one with the exact same terms in each document (e.g., "A B" and "A A B B", the c.s. should be 1.0), one where there is no common term between the two documents ("A B" and "C D", the c.s. should be 0.0), and also an example which produces a c.s. between 0 and 1, perhaps "CS112 HW10" and "CS112 HW10 HW10" which should be 0.9487.
Problem B.5: MiniGoogle (50 points)
For this problem you must use Minipedia.java from the code distribution (which currently is implemented using the DumbList.java data structure to search for titles, insert, and delete articles) as a template to write a programMiniGoogle.java,
which will allow for searching the text of an article using keywords, using Cosine Similarity (see previous). The steps in this process are as follows:
To find articles most similar to a search phrase P:
Use the ArticleTable iterator methods to enumerate all the articles in the
ArticleTable
;For each article, compare the body of the article with the search phrase P (as if they are two documents), by first preprocessing the strings (described below), then using the TermFrequencyTable class to calculate the Cosine Similarity (calling it once for each article to compare with the search phrase);
Insert any articles with a cosine similarity greater than 0.001 into a binary heap (implementing a max priority queue).
Call getMax() for this heap to print out the top three matching articles; if no articles are found, print out "No matching articles found!" but if only 1 or 2 are found, simply print them out.
In order to do this, you should write the following methods.
private static String preprocess(String s) {.....} // take a string, turn it into all lower case, and remove all characters except for letters, digits, and whitespace // (use Character.isWhitespace(..) and similar methods) private static boolean blacklisted(String s) { .... } // determine if the string s is a member of the blacklist (given at the bottom of this assignment); if so do no process it! private static double getCosineSimilarity(String s, String t) {.... } // take two strings (e.g., the search phrase and the body of an article) and // preprocess each to remove all but letters, digits, and whitespace, and then // use the StringTokenizer class to extract each of the terms; create a TermFrequencyTable and // insert each of the terms which is NOT in the blacklist into the table with its docNum // (String s being document 0 and String t being document 1); // finally extract the cosine similarity and return it. public static String phraseSearch(String phrase, ArticleTable T) { .... } // Take an ArticleTable and search it for articles most similar to // the phrase; return a string response that includes the top three // as shown in the sample session shown below
Note that you will construct a TermFrequencyTable for each article that you compare with the search phrase.
For the heap, you may use the code from the class web site, modifying to use the Article (or your own) class instead of integers. I have included a field cosineSimilarity (and two accessor methods) in the Article class if you want to use these to order the heap. Remember that you should only insert articles with a cosine similarity greater than 0.001 with the search phrase into the heap.
Create your heap as a file MaxHeap.java
.
Grading
We will be grading this using a grading client that builds an ArticleTable as in the main method of MiniGoogle, and then calls phraseSearch(...) for various test cases. . You should develop your code as an interactive program which uses Minipedia as a basis to create the program MiniGoogle, and make sure that it performs as you expect.
What to Submit
Submit your files
hw09.txt
TrieNode.java
WordSearcher.java
Lab10.txt
ArticleTable.java (with your own unit test to verify your program works!)
TermFrequencyTable.java (with your own unit test to verify that it works!)
MiniGoogle.java
MaxHeap.java
Article.java (if you made changes to it)
Any other files you need to run your code.
Do NOT submit any of the other files in the distribution, especially do not submit the large folder of articles. You are free to use any of the code in the book or provided in the labs, as long as you credit it in your documentation. When put together with the files Articles.java, DatabaseIterator.java
, and the folder articles
, it should provide the functionality required by the assignment. We will be running your program on the article data base which is included in the code distribution, and also
The Blacklist
private final String [] blackList = { "the", "of", "and", "a", "to", "in", "is", "you", "that", "it", "he", "was", "for", "on", "are", "as", "with", "his", "they", "i", "at", "be", "this", "have", "from", "or", "one", "had", "by", "word", "but", "not", "what", "all", "were", "we", "when", "your", "can", "said", "there", "use", "an", "each", "which", "she", "do", "how", "their", "if", "will", "up", "other", "about", "out", "many", "then", "them", "these", "so", "some", "her", "would", "make", "like", "him", "into", "time", "has", "look", "two", "more", "write", "go", "see", "number", "no", "way", "could", "people", "my", "than", "first", "water", "been", "call", "who", "oil", "its", "now", "find", "long", "down", "day", "did", "get", "come", "made", "may", "part" };
Sample Session
NOTE: Your calculation of the cosine similarity measure may be slightly different, but should be very close to what I show here.
Read 2259 articles from disk. Created in-memory hash table of articles. Welcome to Mini-Google! ======================= Make a selection from the following options: 1. add a new article 2. remove an article 3. Search by article title 4. Search by phrase (list of keywords) Enter a selection (1-4, or 0 to quit): [4 ] // blue is what I typed in Search by article content ========================= Enter search phrase: [dogs cats ] // The red text here is what would be returned as one String by phraseSearch(....)Top 3 Matches: Match 1 with cosine similarity of 0.514: Cat === Cats, also called domestic cats or house cats (Felis catus), are carnivorous (meat-eating) mammals, of the family Felidae. Cats have been domesticated (tame) for nearly 10,000 years. They are currently the most popular pets in the world. Their origin is probably the African Wildcat Felis silvestris lybica. Cats were probably first kept because they ate mice, and this is still their main 'job' in farms throughout the world. Later they were kept because they are friendly and good companions. A young cat is called a kitten. Cats are sometimes called kitty or pussycat. An entire female cat is a queen, and an entire male cat is a tom. Domestic cats are found in shorthair and longhair breeds. Cats which are not specific breeds can be referred to as 'domestic shorthair' (DSH) or 'domestic longhair' (DLH). The word 'cat' is also used for other felines. Felines are usually classed as either big cats or small cats. The big cats are well known: lions, tigers, leopards, jaguars, pumas, and cheetahs. There are small cats in most parts of the world, such as the lynx in northern Europe. The big cats and wild cats are not tame, and can be very dangerous. Match 2 with cosine similarity of 0.4607: Dog === The dog (Canis lupus familiaris) is a mammal from the family Canidae. It has been domesticated by humans for a long time. It was the first animal ever to be domesticated. Dogs are used by humans for many different things, for work and as pets. They are a popular pet because they are usually playful, friendly, and listen to humans. Thirty million dogs in the United States are registered as pets. Dogs eat both meat and vegetables, often mixed together and sold in stores as dog food. Dogs often have jobs, including as police dogs, army dogs, seeing eye dogs, fire dogs, messenger dogs, hunting dogs and sheepdogs. They are sometimes called "canines" from the Latin word for dog - canis. Sometimes people also use "dog" to describe other canids, such as wolves. A baby dog is called a pup or puppy. A dog is called a puppy until it is about one year old. Dogs are sometimes referred to as "man's best friend" because they are kept as domestic pets and are usually loyal and like being around humans. Match 3 with cosine similarity of 0.2712: Pet === The polymer used for plastic bottles, clothes and other things is at Polyethylene terephtalate A pet is a domesticated animal that lives with people, but is not forced to work and is not eaten, in most instances. In most cases, a pet is kept to entertain people or for companionship. Some pets such as dogs and cats are placed in an animal shelter if there is no one willing to take care of it. If no one adopts it or the pet is too old/sick, the pet may be killed. Dogs, cats, fish, birds are the most common pets in North America. Horses, elephants, oxen, and donkeys are usually made to work, so they are not usually called pets. Some dogs also do work for people, and it was once common for some birds (like falcons and carrier pigeons) to work for humans. Rodents are also very popular pets. The most common are guinea pigs, rabbits, hamsters (especially Syrians and Dwarfs), mice and rats. The cap'tchi tribe in Sudan is known for the ritual burning of domesticated animals that are considered too sacred to eat. Welcome to Mini-Google! ======================= Make a selection from the following options: 1. add a new article 2. remove an article 3. Search by article title 4. Search by phrase (list of keywords) Enter a selection (1-4, or 0 to quit): [DrJava Input Box] Search by article content ========================= Enter search phrase: [CS112 HW10 ] // Again, red is what is returned by phraseSearch(....) Top 2 Matches: Match 1 with cosine similarity of 1.0: Test2 ===== CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 CS112 HW10 Match 2 with cosine similarity of 0.9487: Test1 ===== CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 CS112 HW10 HW10 Welcome to Mini-Google! ======================= Make a selection from the following options: 1. add a new article 2. remove an article 3. Search by article title 4. Search by phrase (list of keywords) Enter a selection (1-4, or 0 to quit): [4 ] Search by article content ========================= Enter search phrase: [zymurgy ]There are no matching articles.Welcome to Mini-Google! ======================= Make a selection from the following options: 1. add a new article 2. remove an article 3. Search by article title 4. Search by phrase (list of keywords) Enter a selection (1-4, or 0 to quit): [0 ] Bye! >