This project also deals with the Dictionary abstract data type (ADT). The objectives of the project are to:
Program a binary search tree (BST) implementation of the Dictionary ADT.
This project also deals with the Dictionary abstract data type (ADT). The objectives of the project are to:
Program a binary search tree (BST) implementation of the Dictionary ADT.
CSC 316 - Data Structures
Programming Project
This project also deals with the Dictionary abstract data type (ADT). The objectives of the project are to:
Program a binary search tree (BST) implementation of the Dictionary ADT.
Gain experience with recursion by writing recursive procedures to (1) construct a BST from its preorder
traversal, and (2) implement the Dictionary ADT methods.
Conduct experiments to evaluate the performance of the BST under various key frequency distributions.
The Dictionary ADT
As in the previous project, we consider a dictionary data structure to store key-value pairs (k; v), where k is a
9-digit integer key and v is a string associated with key k. We will again assume that key values are unique;
i.e., the dictionary does not contain multiple entries with the same key.
Binary Search Tree Implementation
You will implement the Dictionary ADT as a BST that can be searched by the key value, and you will carry
out an experimental study to determine the eciency of your implementation in terms of the cost of the find,
insert, and remove operations.
For this project, you may not use built-in Java classes. You need to implement your own classes.
A node in your BST implementation will be a compound object consisting of four elds:
1. A key eld holding the (unique) 9-digit integer key upon which a search is based.
2. A value eld of type string holding the information associated with the key.
3. A leftChild eld which is a reference (pointer) to the node in the tree designated as the left child of this
node.
4. A rightChild eld which is a pointer to the node in the tree designated as the right child of this node.
Constructing a BST from its Preorder Traversal
In the rst part of the assignment you will be given the preorder traversal of a BST, and your task will be
to develop and implement a recursive procedure to construct the actual tree. Let us discuss this part in some
detail.
Consider an unknown tree with preorder traversal (the values shown are the keys at the various nodes):
10 5 9 6 8 45 32 24 36 58 49
1We know that the root of the tree is 10, since it is rst in preorder, and that the tree has 11 nodes. So we
need to gure out how to break the sequence
5 9 6 8 45 32 24 36 58 49
into subtrees. Clearly, 5 is the root of one of the subtrees of 10. Since 5 < 10, we conclude that 5 is the
root of the left subtree of 10. Recall now that, according to the preorder traversal, the right subtree of 10 is
visited after the left subtree is visited. Furthermore, all key values in the right subtree of 10 are greater than
10, and greater than any key in the left subtree of 10. So, we conclude, that 45 is the root of the right subtree
of 10 (this involves scanning the keys in the traversal to the right of 10 until we nd a key, if any, larger than
10). We also conclude that the left subtree of 10 consists of four nodes, and its preorder traversal is
5 9 6 8
while its right subtree consists of six nodes, with preorder traversal
45 32 24 36 58 49
Let us consider the left subtree of 10. Node 5 is the root of that subtree. Since the next node in the traversal
is 9, with a larger key than 5, we conclude that 5 has no left subtree, but only a right subtree rooted at 9, with
three nodes and preorder traversal
9 6 8
In eect, we have taken the problem of constructing a tree from its preorder traversal and subdivided it into
a number of similar problems on shorter traversals. When the traversals have length one (1), the answer will be
obvious (i.e., the subtree consists of only one node with no children). Therefore, this approach is the basis of
a recursive procedure to construct the tree. The initial preorder traversal would cause a representation of the
tree shown in Figure 1 to be constructed.
Suggestion: After you read the preorder traversal from the input le, write the following recursive function:
function BuildTree (int size, int start): reference to Root node of the subtree
where size is the number of nodes in the subtree to be built, and start is the place in the preorder traversal
of the whole tree where the preorder traversal of this subtree begins.
10
5 45
32 58 9
49 36 24 6
8
Figure 1: Binary search tree with preorder traversal 10 5 9 6 8 45 32 24 36 58 49
2Binary Search Tree Operations
In the second part of this project you will implement the find, insert, and remove operations on binary search
trees, as discussed in class and in Chapter 10.1 of the textbook. These operations should be implemented
as recursive procedures. Make sure that your programs handle insertions to an initially empty tree, insertions
of keys already in the tree (no action should be taken in this case), deletions from an empty tree, deletions
of keys not in the tree (again, no action should be taken), and other special cases. Finally, you should write
a recursive procedure to perform an inorder traversal of the tree. You will use this procedure to print the
key-value pairs in the tree in increasing order of key.
Input
Your program should accept input in the form of one key-value pair per line, as in the rst project. The rst
input le will consist of three sequences, each sequence terminating with an invalid key value, -1, as shown
below.
571511168 IGBBBFBHF
123686464 EGEGIGDCB
218246560 AGFGECIBC
998763264 EGCDGHIJJ
697724736 GDHECHHJG
... ...
838448960 AGJIEEIDI
849677696 GJGHHGJEI
-1
458349032 EKYTLWOPF
672954823 GEWRTQUEK
... ...
218246560 AGFGECIBC
-1
998763264
458349032
...
123686464
459372419
-1
The rst sequence represents the preorder traversal of a binary search tree. Using this sequence, you will
reconstruct the BST, as we discussed earlier. The second sequence represents a sequence of key-value pairs
to be inserted into the BST constructed using the preorder traversal. Note that this sequence may contain
key-value pairs that already exist in the tree, in which case no action is taken (your program should be able
to handle this special case). Finally, the third sequence represents a sequence of keys to be removed from the
tree obtained after the insertions have been made. This sequence may contain keys that do not appear in the
dictionary (again, your program should handle this case and not modify the tree in this case). You can assume
that the total number of nodes in the tree will not exceed 20,000.
Two les similar to the example shown above are available in the Project2 directory: le small.dat contains
three small sequences (mainly for debugging purposes, so that you can trace all the operations by hand), and
le large.dat contains three large sequences of keys.
The second input le will contain a sequence of keys, one per line, which you will use as a sequence of find
operations to evaluate the cost of searching the BST under dierent key access distributions. Files Uniform.dat,
Zipf.dat, and Geometric.dat in the Project2 directory contain sequences of keys drawn from the respective
distributions we described in the rst project. Note that a certain key may appear multiple times in each
sequence, implying that the corresponding find operation must be performed multiple times. Also, a key in a
3sequence may or may not appear in the dictionary.
Deliverables: Source Code To Be Tested By The TA
You will implement a Java program named BST to perform the operations described above. For your implemen-
tation, you may assume that the number of distinct keys contained in the dictionary will not exceed 20,000,
while the number of keys in an input sequence will not exceed 100,000.
The program you submit will prompt the user for the names of the input and output les. Your program
will read all the key-value pairs from input le input.dat, whose format will be identical to that of the les
small.dat and large.dat in the Project2 directory. The number of key-value pairs in each sequence in the
le will not be provided to you in advance, but will not exceed 100,000.
Your program will build the BST from the rst sequence in the input le (the preorder traversal), and will
then perform the insert and remove operations corresponding to the next two sequences, as explained above.
Once all insertions and deletions have been performed, your program will write to the output the keys of the
tree inorder (recall: this means in increasing order), one key-value pair per line, as follows: the 9-digit integer
key is printed rst, left-justied, followed by four spaces, followed by the string representing the value. The TA
will compile and run your programs, and use the diff utility to check the output. Make sure that you follow
the above format, to spare the TA the time and eort to deal with misaligned les of hundreds of lines. The
output of your programs will be tested for an input sequence that will not be made available in advance.
Deliverables: O-Line Experiments, Graphs and Report
[Note: This part of your code will not be tested by the TA, you will only use it to run experiments o line
and generate performance graphs and answer the questions below.] We are interested in knowing how the BST
implementation performs in terms of the cost of the find operations. The cost of find(k) will be taken as
the number of key comparisons needed to locate k in the dictionary. To compute this cost, you will regard the
sequence of keys in the les Uniform.dat, Zipf.dat, and Geometric.dat as a sequence of find operations for
the corresponding key. You will perform these find operations on the BST resulting after you build the tree
and perform the insert and remove operations in input le large.dat. For each find(k) operation, you will
count the number of comparisons it takes to locate the entry with key k in the dictionary (or determine that
key k is not in the dictionary).
You will regard the rst 1,000 keys in the sequence of find operations as an input size of 1,000, the rst
2,000 keys as an input size of 2,000, and so on. Your program must output the total number of comparisons
(cost) so far after every 1,000-th key. You can then plot the cost against the input size using a utility of your
choice. You should also determine the minimum, maximum, and average cost of any find operation.
Create one graph plotting the cost against the input size for each of the three key access distributions,
using the les Uniform.dat, Zipf.dat, and Geometric.dat as input. Include the graph in your report, and
discuss the behavior of the curves. Is it possible to express the minimum, maximum, and average cost of a find
operation as a function of the number of keys in the dictionary (BST)? Does the key distribution aect the
total cost or the minimum, maximum, and average costs? Why or why not? How do the graphs (specically,
their trend, or behavior) compare to the ones you obtained from the three implementations of Project 1? How
do the average, minimum, and maximum cost of a find operation compare to the corresponding costs of the
previous three implementations? Overall, how do the four Dictionary ADT implementations compare?
Submission
Submit all your programs, as well as your report (in PDF), using the corresponding submission locker on the
course Web page. The deadline for submission is midnight (Eastern Time) on July 06, 2014.
4Grading
Source code: build tree, insert, remove, nd, inorder traversal 90 Points
Graphs and report 10 Points
100 Points
Important reminder: this programming project is an individual, not a group, assignment. Students may
discuss approaches to problems together, but each student should write up and fully understand his/her own
solution. Students may be asked to explain solutions orally if necessary.
5