Implementing the sorter

This is part three in a series of blog posts for my ‘IT Professional Skills’ module

In my previous post, I described how my tree-based word sorting program will operate on its input. I hope, at this point, that my explanations have given you a high-level understanding on what my program will be doing. In this post, I will begin implementing my algorithm as a small command-line based C utlity.

As with any problem, it’s helpful to break it down into its component parts and tackle each part in turn. In programming, this naturally translates into writing small, simple functions that do one thing, and do it well. Let’s begin in main (), where the skeleton of my program will be laid out.

int
main (int   argc,
      char *argv[])
{
  FILE *fp;
  char w[MAX];
  Node *root = NULL;

  fp = fopen ("../data/keats.txt", "r");

  while (getword (w, fp, MAX) != EOF)
    {
      if (strlen (w))
        root = addtree (root, w);
    }

  treeprint (root);
  treefree (root);

  fclose (fp);

  return 0;
}

I won’t be exploring every detail of my function–rather, just the parts that are central to its operation. Note, that I’ve hardcoded the file to use: “keats.txt”. A more useful implementation would allow the user to specify it, but that’s not essential for this example.

The meat of this function is the while loop. Here, the function getword () will keep running until we’ve reached the end of the input file (signified in C by the EOF define). Let’s look at that function:

int
getword (char *w,
         FILE *fp,
         int   lim)
{
  int c;

  while (isspace (c = getc (fp)) || ispunct (c))
    ;

  if (c == EOF)
    return EOF;

  if (isalnum (c))
    *w++ = tolower (c);

  while (!isspace (c = getc (fp)) && --lim)
    {
      if (isalnum (c))
        *w++ = tolower (c);
      else if (c == '\'' || c == '-')
        *w++ = c;
    }
  *w = '\0';

  return 0;
}

As its name indicates, my getword () function reads the file word by word. However, how should it define a word? In this case, I’ve chosen to define a word as a string of alphanumeric characters that may also contain an apostrophe or a dash. The function skips any leading whitespace or punctuation characters, and stops once it hits another whitespace character. It’s quite a naive algorithm, but it will work for my purposes. Perhaps you can find some text where it’ll fail, or implement a better algorithm yourself?

Inside the loop, if the string that getword () gives us is not empty, it’s added to my binary tree by the addtree () function. Let’s examine that next:

Node *
addtree (Node *n,
         char *w)
{
  int cond;

  if (!n)
    {
      n = malloc (sizeof (n));
      n->data = strdup (w);
      n->left = n->right = NULL;
    }
  else if ((cond = strcmp (w, n->data)) < 0)
    {
      n->left = addtree (n->left, w);
    }
  else if (cond > 0)
    {
      n->right = addtree (n->right, w);
    }

  return n;
}

The most interesting aspect of this function is its recursive nature. Note how the function calls itself. Otherwise, it’s relatively simple. If n is NULL, memory is set aside for it and its data member is pointed at memory containing w, the word that getword () read and stored earlier.

If n is already exists in memory, the function checks if w is lexicographically greater or smaller than what’s store in n. To avoid duplicate words, nothing is done if the values are equal. This set of operations is performed recursively, the tree is traversed until an empty spot is found, where the new Node will then be slotted in.

I hope this has illuminated the concepts I’ve discussed so far. In the next post, I will finish this program and demonstrate it in action.