Presented by: Team 20
Don Han, g9dhan@cdf
Jeff Xing, g0xinzhi@cdf
Carol Cheung, g9carolc@cdf
Presented to: Professor
Dave Warman
CSC 408
Software Engineering
Department Of Computer Science
University of Toronto
Presented Date: Feb 27, 2001
TABLE OF CONTENTS
Introduction
......................................................................................................
Class BigStringdhan
......................................................................................................
Class BTreeString
......................................................................................................
Class LeafNode
......................................................................................................
Class ParNode
......................................................................................................
Summary
......................................................................................................
Brief Description on Testing ......................................................................................................
Appendix
......................................................................................................
INTRODUCTION
The implementation of phaseB
contains four classes: BigString, BTreeString, LeafNode and Parnode. It
has also several test drivers, but is not included in documentation part
(see test part for details of the test drivers).
Class BigString
will manipulate the following operations specfified by the BigStringinterface:
constructors, length(), toString(), charAt(), concat(), print(). It works
by communicating with class BTreeString.
Class BTreeString
is
the core of the implementation. It supports all the operations requested
by class BigString. It uses a B+ tree data structure, and
stores the long string only in the leaves. It works by communicating with
classes LeafNode and ParNode.
Class LeafNode supports
the those operations requested by class BTreeString that
are related to the leaves. A leaf in Class LeafNode holds
a key and a segment of long string.
Class ParNode supports
the those operations requested by class BTreeString that
are related to the parent nodes. A parent node in Class ParNode
holds three arrays: keys and pointers to next level (either parent nodes
or leaves).
Class: BIGSTRING
The Structure of a BigString
is as follows: a BigString contains a reference to a B+ tree which stores
the first part of the string, and a vector to store other BigStrings which
resulted from possible concatenations.
Features of a BigString:
'original' BigStrings only have a B+ tree storing their string; the vector
is empty because they are not the result of concatenations or substrings
of other BigStrings. Although, substrings of other BigStrings may also
have empty vectors, if their substring is entirely within a particular
tree.
the vector of a BigString will only contain other BigStrings with empty
vectors (contain only B+ trees, essentially). There will be no recursive
references to BigStrings (ie. in this BigString's vector, it contains a
reference to a BigString, which in it's vector, contains another reference
to another BigString, which, ... etc.) So, concatenations may involve copying
of the contents of vectors of BigStrings, instead of just storing a reference
to the BigString parameter into the new BigString's vector.
BigStrings resulting from concatenations and requests for substrings do
not involve copying the contents of the string. Instead, they reference
parts of the BigStrings that were used as arguments to creating the new
BigString, with some data variables to keep track of the attributes of
the BigString.
Instance variables:
bt -- reference to the tree of this BigString
stringLength -- the length of the BigString; that is, the
number of characters in the string value of this BigString.
start -- the index of where this BigString begins; It is
0 from BigStrings that are not substrings. It may be > 0 if the > BigString
is a substring (since a substring references parts of other BigStrings).
vect -- vector to hold BigStrings that are a part of this
BigString, as a result of concatenations.
treeLimit -- index of the final character in the B+ tree
that this BigString reaches. This is needed for substrings, because some
substrings do not use the entire string held in the tree. The following
condition always holds: 0 <= start <= treeLimit <=
bt.getStringLength().
Constructors:
BigString() -- initializes a new BigString with the empty
string as its string value.
BigString(BigString bigValue)-- initializes a new BigString
so that it will reference the same tree as bigValue does. A NullPointerException
will be thrown if bigValue is null.
BigString(String value)-- initializes a new BigString so
that its string value will be that of value.
BigString(char [] value) -- initializes a new BigString so
that its string value will be the string value of value.
BigString(StringBuffer value) -- initializes a new BigString
so that its string value will be the string value of value.
Operations:
length() -- returns the length of this BigString. Everytime
a new BigString is created, the length of the BigString is stored, so there
is no need to recalculate the length. It just returns the instance variable
stringLength.
concat(BigString bigString) -- returns a reference to a BigString,
which is the result of appending bigString to this BigString. If
bigString
is null, a NullPointerException is thrown.
Algorithm:
BigString b; // the BigString to return
if(bigString has an empty vector)
set b's instance variables;
set b's tree to be this
BigString's tree;
copy the contents of this
BigString's vector into b's vector;
add bigString to
b's vector;
else
set b's instance variables;
set b's tree to be this
BigString's tree;
copy the contents of this
BigString's vector into b's vector;
create a new BigString temp
which has an empty vector and whose tree references bigString's
tree;
add temp to b's vector
copy contents of bigString's
vector into b's vector
end if
return b;
charAt(long index) -- gets the character located at
index,
provided that 0<= index <= stringLength. Otherwise,
a StringIndexOutOfBoundsException is thrown.
Algorithm:
if(start is non-zero) set the new start to ns;
if(the character we are looking for is within the
tree) return the character in the tree with that index;
else // we have to look at the
contents of the vector
loop
if(the length of the next BigString in the vector < ns)
decrement the length of that BigString from ns;
go to the next element in the vector;
else break
end loop
return the character in
the particular element in the vector that is located at ns;
Decrementing is needed: for example if you want to locate character
250 in the following BigString:
tree:
starting index = 0, treeLimit = 99
Vector:
element 0 - starting index = 0, length = 100
element 1 - starting index = 0, length = 100
then the character we want is the character at index 50 (250 - 100
- 100) in element 1.
toString()-- return the substring from start to start
+ stringLength
print(OutputStream os) -- print the characters onto
os
from start to start + stringLength. An IOException will be
thrown if there are any errors with os.
Operations for phase C:
substring(long begin) -- returns a BigString, with string
value as the substring of this BigString, starting at begin and
ending where this BigString ends. A StringIndexOutOfBoundsException will
be thrown if the condition 0<= begin <= stringLength
is not met.
Algorithm:
BigString b;
if(begin indexes a location in this BigString's
tree)
set start of b to begin;
set b's length to be this
BigString's length - begin;
set b's treeLimit
to be this BigString's treeLimit;
set b's tree to reference
this BigString's tree;
copy the contents of this
BigString into b's vector;
return b;
else // look at the contents of
the vector
locate the element in this
BigString's vector that the substring should start at;
let that be element k;
set b's instance variables;
set b's tree to be element
k's tree;
copy the contents of this
BigString's vector into b's vector;
return b;
end
substring(long begin, long end) -- returns a BigString with
string value as the substring of this BigString, starting at begin
and ending at end. A StringIndexOutOfBoundsException will be thrown
if the condition 0 <= begin <= end <= stringLength
is not met.
Algorithm:
BigString b;
set b's length to end - begin if(begin indexes a location in this
BigString's tree)
set start of b to begin;
set b's tree to reference
this BigString's tree
if(end - 1
<= this BigString's treeLimit)
set b's treeLimit = end;
return b;
else
set b's treeLimit to be this BigString's treeLimit ;
let k be the last element in this BigString's vector that the substring
needs;
copy all elements in this BigString's vector up to k into b's vector;
return b;
end if
else
let k be the element in
this BigString's vector that contains the start of the substring;
let j be the element in
this BigString's vector that contains the end of the substring;
set b's tree to element
k's tree;
set the instance variables
of b;
copy all elements in this
BigString from elements k+1 to element j, inclusive;
return b;
end
Optimize() -- improves the efficiency of BigString. Implementation
is up to you. One possibility: "undoing" all the concatenations in the
vector, and putting all the contents of the BigString into the tree.
Algorithm:
get the string value of this BigString;
create a new tree using that string value;
empty the vector;
update instance variables;
equals(Object obj) -- returns true iff obj is a non-null
BigString which has the same string value, case sensitive, that this BigString
has, false otherwise.
compareTo(BigString bigString) -- returns: -1 if this BigString
is "smaller" than bigString. Smaller meaning lexicographically precedes;
0 if this BigString is "equal" to bigString. Equal meaning lexicographically
equivalent; 1 otherwise (lexicographically follows).
compareToIgnoreCase(BigString bigString) -- same as
compareTo,
except in the comparisons, the case of the character, if any, will be ignored.
equalsIgnoreCase(BigString bigString) -- returns true if
bigString
is non-null and has the same string value, case insensitive, that this
BigString has, false otherwise.
indexOf(String string)-- returns the index
(long int) of the character of the first location where string is
found in this BigString, -1 if not found. A NullPointerException is thrown
if
string is null.
indexOf(BigString bigString) -- returns the index (long
int) of the first occurence of bigString in this BigString, -1 if
bigString
is not found in this BigString. A NullPointerException is thrown if bigString
is null.
indexOf(char ch) -- returns the index (long int) of the first
occurence the character ch in this BigString, -1 if this character
is not found.
Helper methods (other than the ones specified in handout):
substr(long a, long b) -- returns a String that is the value
between indices [a,b) of this BigString; inclusive on a,
exclusive on b. Note that substr is a helper method of toString().
Methods of BTreeString class called within this method may throw StringIndexOutOfBoundsException.
Algorithm:
String s = "";
if(the my B+ tree contains all the characters from
index a to index b)
return the substring from
the B+ tree in the range [a,b)
else if(the substring to return contains some tail
end of the tree)
concatenate that end of
the tree to s;
keep track of how many characters
are in s so far.
if(the number of characters in s so far < (b-a))
go through the vector, starting
at element 0,
looking at the BigStrings
and concatenating enough substrings to s so that s is of length (b-a)
output(long a, long b, OutputStream os) -- prints the characters
from [a,b) of this BigString onto the stream os; inclusive
on a, exclusive on b. Note that output is a helper method
of print. An IOException will be thrown if there are any errors with os.
Also, a StringIndexOutOfBoundsException is thrown if the condition start
<= a <= b <= start + stringLength
is not met.
Algorithm:
if(the my B+ tree contains all the characters from index
a to index b)
output those characters
in that range from the B+ tree onto the specified OutputStream
return
else if(some of the characters are contained
in the tail end of the tree)
output those characters
onto specified OutputStream
keep track of how many characters
are outputed so far.
if(the number of characters in s so far < (b-a))
go through the vector, starting
at element 0,
looking at the BigStrings
and output putting characters from the BigStrings so that we have outputed
(b-a)
characters
getTreeLim() -- returns the value of the treeLimit
of this BigString
getVect() -- returns a reference to this BigString's vector
getStart() -- returns the value of the starting index of
this BigString
getMyBTree() -- returns a reference to this BigStrings's
B+ tree
Private helper methods:
addElement(BigString bStr) -- appends bStr to the
end of this BigString's vector.
setMyContents(BigString b1, BigString b2) -- method to have
parts of this BigString reference parts of BigStrings b1 and b2.
Class: BTREESTRING
The fundamental data structure
of the BigString library is the B+ tree. It is implemented in class BTreeString.
The leaves of the B+ tree stores substrings (held in character arrays)
of the original string. That is, the original string is broken up into
substrings. Each piece is stored in a leaf. The maximum possible length
of a substring in a leaf is fixed in SMALL_STRING_LENGTH (SSL). The leaves
of the tree are also held together in a linked list. This linked structure
will increase the performance significantly when we want retrieve (or partially
retrieve) the tree (e.g., the toString() or print() operation) the tree,
because we can simply find the beginning leaf, and follow the linked leaves
until the ending leaf.
All leaves in the linked
list will contain strings with length SSL, except the last whose length
is less than or equal SSL. The internal nodes of the B+ tree contain an
array of pointers to children nodes. The maximum number of children an
internal has is also fixed. This is referred to as the ORDER. The height
of the tree is dependent on the number of characters (n), the number of
children (ORDER) and the length of a substring (SSL). That is
If the height of the tree is 1 (there is only one leaf node), then
the root of the B+ tree is a LeafNode. Otherwise, the root of the B+ tree
is a ParNode.
The tree will always be built so that there is as little space wasted
as possible. And the tree in this design is not as complicated as typical
B+
trees because, from the specs of BigString interface, only append is involved
during the building of the tree and there are no insertions or deletions
after the tree is built.
The tree is built as follows:
if there is room in the existing tree, just append the new leaf to the
end of the linked list of leaves and make appropriate updates in all the
ancestors so that the new leaf can be accessed. That is, the immediate
parent of the leaf will have to have its pointers to children updated,
as well as the keys keeping track of the leaves. All other ancestors will
have to have their keys updated only. Otherwise, the existing tree is full,
so the existing tree will become the leftmost subtree of a new root. The
right subtree of the new root is the tree of ancestors to the new leaf.
Also, the new leaf will be appended to the end of the linked list of leaves.
The building of string to a B+ tree is shown in Figure 1.
Trees (and BigStrings) will
be stored totally in memory. But the OS of UNIX can handle this even if
the long string does not fit the main memory, because current implementation
(linked structure) does not use contiguous memory.
Constructor:
BTreeString(String string) -- constructs a B+ tree to store
string.
Instance variables:
stringLength-- the length of string
height -- the height of the tree
leafRoot -- the root of the tree which has only one node
(leaf)
parRoot -- the root of the tree which has more than one nodes
isRootAParNode -- boolean variable, true if the root is a
ParNode
Public Methods:
getString(long beginIndex, long endIndex) -- returns a string
that is stored in the tree, containing the characters from beginIndex
to endIndex-1
printTree(long beginIndex, long endIndex, OutputStream os)
- outputs the string, that is stored in the tree containing the characters
from beginIndex to endIndex - 1, one leaf by one to stdout.
An IOException will be thrown if there are any errors with os
getChar(long index)- returns the character of the string
stored in the tree at position index, beginning with 0.
getStringLength() - returns the length of this string value
of this tree.
Private Methods:
searchLeaf(long index) - returns the leaf containing the
characters at index.
findLeaf(long beginKey) - returns a path (vector) that holds
all the nodes whose keys are equal to beginKey, starting from the
root.
updateKey(Vector v, int curLevel, long key) - updates the
last key of the most right node of each level (held in v) to be
key,
beginning from curLevel and ending at the root.
updateTree(Vector v, LeafNode leaf, ParNode par, int curLevel, long
key) - update the tree when a new leaf is appended to the tree.
Algorithms:
BTreeString(string) feeds the first characters of string to a
char array A with length min{SSL, stringLength)
create a new leaf with (A, 0) and let the leaf be
the root
loop until the end of string feeds the string to a charArray
A and get a key k
append(A, k);
Append(A, k) create a new leaf L (calls LeafNode(A, k))
find the path (a vector) from root to the last leaf
M (call LeafNode(A, k))
let M point to L (calls M.setNext(L))
updateTree(path, null, L, height-1, k)
updateTree(path, P, L, level, k) parameters check, ParNode P and leaf L can not be
both null or both not-null
if current root is a leaf
// height = 1
create a new ParNode and
let it to be the new root R (calls ParNode())
let R point to both the
old root and leaf L (calls R.setLeafNode(leaf, key) twice)
set the leafRoot be null,
increment the height and return
else if the current level is one level below the
root
// level = 1
if R still has at least
one more slot
if L is not null
then let R point to L (calls R.setLeafNode(k, L))
else let R point to par (call R.setParNode(k, par))
return
else // we need
a new root
create a new root R? and a new Parnode Q
if L is not null
then let R point to L (calls Q.setLeafNode(k, L))
else let Q point to P (call Q.setParNode(k, P))
let R? point to R and Q, set R? to be the root
increment the height and return
else if L is not null and L?s parent Par is not
the root
// level = height-1 AND level > 1
if Par still has at least
one more slot
then let Par point to L
updates only the last key
of Par?s parent (calls updateKey(path, level-1, key),
this method works recursively)
else
creates a new ParNode Q and let Q point to L
recursively calls updateTree(path, null, Q, level-1, k)
else
// P is not null and P?s parent Par is not the root
if Par still has at least
one more slot
then let Par point to P
updates only the last key of Par?s parent (calls updateKey(path, level-1,
key))
else
creates a new ParNode Q
and let Q point to P
recursively calls updateTree(path, null, Q, level-1,
k)
Class: LEAFNODE
The leaves of the B+ tree.
Leaves contain a character
array, which contains a small segment of the string represented by the
BigString. Each leaf contains a pointer, so that it may reference another
LeafNode. This is because the leaves of a B+ tree are linked together in
a linked list (singly linked, in our case). The pointer to the LeafNode
from this leaf is a reference to this leaf's successor in this list. As
well, a leaf has a unique key (unique compared to the keys of other leaves
in the same B+ tree), which indicate how far from the start of the linked
list this leaf is. Keys are ordered starting from 0 and increasing by 1
each time.
The constructor takes a
reference to a character array and a key, which the leaf's instance variables
will be set to.
Constructor:
LeafNode(char [] val, long key) -- constructs a leaf which
contains
val
and
key
Instance variables:
elements -- array of characters.
nextNode -- pointer to the successor, if any.
key -- the key for the leaf.
SMALL_STRING_LENGTH -- the maximum number of characters in
this leaf's array, it is declared to be protectedfinal static
Methods:
getNext() -- returns a reference to the leaf that follows
this leaf, null if none.
setNext(LeafNode leaf) -- sets this leaf's successor to be
the leaf.
getElementAt(int i) -- returns the ith element of
the character array held in this leaf.
toString() -- returns the string value of the character array
in this leaf.
stringBetween(int a, int b) -- returns the string value of
the substring found in the character array between indices a,
b;
inclusive on a, exclusive on b.
Class: PARNODE
The internal node, or root,of
the B+ tree
The ParNode merely contains
information and pointers about it's children nodes. Its children may be
leaves, or other internal nodes. A ParNode contains two array of pointers:
an array of pointers to leaves, and an array of pointers to other internal
nodes. If a ParNode points to at least one leaf, then it only points to
leaves, it does not point to other internal nodes. Likewise, if a ParNode
has at least one child that is another internal node, then all its children
are internal nodes; it does not have any leaf children. Before adding any
children to a ParNode, one should check what kind of children it potentially
has. Also, a ParNode contains an array of keys: if the ith element
in this array of keys contains the value K, then it is the ancestor of
leaves with keys <= K. All keys >= 0.
The constructor takes no
parameters, and initializes children-pointer-array elements to null and
key-array elements to -1, as it is unknown at the time of initialization
what kind of children a ParNode will have.
Constructor:
ParNode() -- constructs a parent node with all its contents
to be null or 0
Instance variables:
ORDER -- the maximum possible number of children for a ParNode,
it is declared to be protected final static
EMPTY -- a flag to indicate that this element has no children.
leaves -- array of pointers to children leaves.
nextLevel -- array of pointers to children internal nodes.
keys -- array of long ints.
Methods:
nextAvailable() -- returns the first available index of the
array that contains a free slot for a child to be inserted, -1 if this
ParNode is full of children.
updateLastKey(long k) -- resets the the last non-negative
key in keys to be k.
getLastKey() -- returns the value of final element in keys
array.
getLeafAt(int i) -- returns a pointer to the leaf referenced
by the ith element of the array leaves.
getNextLevelAt(int i)-- returns a pointer to the next internal
node referenced by the ith element of the array of internal node
children (nextLevel).
getKeyAt(int i) -- returns the value of the ith element
of keys
setParNode(long k, ParNode p) -- locates the first available
empty slot for a child ParNode and inserts p into that slot and
updates the corresponding element of keys with k.
setLeafNode(long k, LeafNode leaf) -- locates the first available
empty slot for a child leaf and inserts leaf into that slot and
updates the corresponding element of keys with k.
Figure 2 gives a snapshot
that how a big string is being stored in a B+ tree in our implementation.
TESTING
Testing in BigString project
is divided into four phases: Unit testing, Integration testing, Validation
testing, and System testing (see test part for details).
Unit testing tests moduls separately, Intergration testing tests how good
modules fit together, Validation testing tests how well software meets
requirements and System testing tests overall system's functionality in
real-life situations. In all testing cases, our implementation eith worked
correctly or throwed exceptions where applicapable.
SUMMARY
Our implementation satisfies
the specs on completion, and satisfies the design requirements on information
hiding, low coupling and high cohesion.
All the instance variables are private.
Each class (module) focuses on its task.
The only two shared variables by classes BTreeString, LeafNode
and ParNode - ORDER and SMALL_STRING_LENGTH - are declared
to be protected final static, so that we need only change once if we have
to change the value of them.
APPENDIX
There are several changes from
the original design (Phase A):
Superclass Node was removed since there is not much in common
between classes LeafNode and ParNode.
In class BTreeString, method add() was removed
since a new leaf is always appended to the most right bottom and a append()
method is sufficient for this purpose.
In class LeafNode, we added five methods: getNext(),
setNext(). getElementAt(), getStringBetween() and toString().
In class ParNode, we added two instance variables: leaves
and keys (both are array) and eight methods: nextAvailable(),
updateLastKey, getLastKey(), getNextLevelAt(), getkeyAt(), setParNode()
and
setLeafNode().