This is part three in a series of blog posts for my ‘IT Professional Skills’ module
In my previous post, I described how my tree-based word sorting program will operate on its input. I hope, at this point, that my explanations have given you a high-level understanding on what my program will be doing. In this post, I will begin implementing my algorithm as a small command-line based C utlity.
As with any problem, it’s helpful to break it down into its component parts and
tackle each part in turn. In programming, this naturally translates into writing
small, simple functions that do one thing, and do it well. Let’s begin in
main ()
, where the skeleton of my program will be laid out.
int
main (int argc,
char *argv[])
{
FILE *fp;
char w[MAX];
Node *root = NULL;
fp = fopen ("../data/keats.txt", "r");
while (getword (w, fp, MAX) != EOF)
{
if (strlen (w))
root = addtree (root, w);
}
treeprint (root);
treefree (root);
fclose (fp);
return 0;
}
I won’t be exploring every detail of my function–rather, just the parts that are central to its operation. Note, that I’ve hardcoded the file to use: “keats.txt”. A more useful implementation would allow the user to specify it, but that’s not essential for this example.
The meat of this function is the while
loop. Here, the function getword ()
will keep running until we’ve reached the end of the input file (signified in C
by the EOF
define). Let’s look at that function:
int
getword (char *w,
FILE *fp,
int lim)
{
int c;
while (isspace (c = getc (fp)) || ispunct (c))
;
if (c == EOF)
return EOF;
if (isalnum (c))
*w++ = tolower (c);
while (!isspace (c = getc (fp)) && --lim)
{
if (isalnum (c))
*w++ = tolower (c);
else if (c == '\'' || c == '-')
*w++ = c;
}
*w = '\0';
return 0;
}
As its name indicates, my getword ()
function reads the file word by word.
However, how should it define a word? In this case, I’ve chosen to define a word
as a string of alphanumeric characters that may also contain an apostrophe or a
dash. The function skips any leading whitespace or punctuation characters, and
stops once it hits another whitespace character. It’s quite a naive algorithm,
but it will work for my purposes. Perhaps you can find some text where it’ll
fail, or implement a better algorithm yourself?
Inside the loop, if the string that getword ()
gives us is not empty, it’s
added to my binary tree by the addtree ()
function. Let’s examine that next:
Node *
addtree (Node *n,
char *w)
{
int cond;
if (!n)
{
n = malloc (sizeof (n));
n->data = strdup (w);
n->left = n->right = NULL;
}
else if ((cond = strcmp (w, n->data)) < 0)
{
n->left = addtree (n->left, w);
}
else if (cond > 0)
{
n->right = addtree (n->right, w);
}
return n;
}
The most interesting aspect of this function is its recursive nature. Note how
the function calls itself. Otherwise, it’s relatively simple. If n
is NULL
,
memory is set aside for it and its data
member is pointed at memory containing
w
, the word that getword ()
read and stored earlier.
If n
is already exists in memory, the function checks if w
is
lexicographically greater or smaller than what’s store in n
. To avoid
duplicate words, nothing is done if the values are equal. This set of operations
is performed recursively, the tree is traversed until an empty spot is found,
where the new Node
will then be slotted in.
I hope this has illuminated the concepts I’ve discussed so far. In the next post, I will finish this program and demonstrate it in action.