Using Python to Generate HTML Pages

Introduction

I have waited for a long time to set up my own Web site, mostly
because I didn’t know what to put there that others may want to
see. Then I got an idea. Since I’m an avid reader and an aviation
enthusiast, I decided to create pages with a list of aviation books I
have read. My initial intention was to write reviews for each book.

Setting up the pages was easy to start with, but as I added more books
the maintenance became tedious. I had to update couple of indices with
the same data and I had to sort them by hand, and alphabetizing was
never my strong suit. I needed to find a better way.

Around the same time I became interested in the programming language
Python and it seemed that Python would be a good tool to automatically
generate the various HTML pages from a simple text file. This would
greatly simplify the updates of my book pages, as I would only add one
entry to one file and then create complete pages by running a Python
script.

I was attracted to Python for two main reasons: it’s very good at
processing strings and it’s object oriented. Of course the fact that
Python interpreter is free and that it runs on many different systems
helped. At first I installed Python on my Win95 machine, but I just
couldn’t force myself to do any programming in the Windows
environment, even in Python. Instead I installed Linux and moved all
my Web projects there.

The Problem

The main goal of the program is to generate three different book
indices, by author, by title and by subject, from a single input
file. I started by defining the format of this file. Here is what a
typical entry describing one book looks like:

	title: Zero Three Bravo
	author: Gosnell, Mariana
	subject: General Aviation
	url: 3zb.htm
	# this is a comment

Each line starts with a keyword (eg. “title:” or “author:”) and is
followed by a value that will be shown in the final HTML
page. Description of each book must start the “title:” line, there
must be at least one “author:” tag, and the “url:” entry points to a
review of the book, if there is one.

Since Python is object-oriented we begin program design by
looking for “objects”. In a nutshell, object oriented (OO) programming
is a way to structure your code around the things, that is “objects”,
that the program is working with. This rather simple idea of
organizing software around what it works with (objects), rather than
what it does (functions), turns out to be surprisingly powerful.

Within an OO program similar objects are grouped into “classes” and the
code we write describes each class. Objects that belong to a given
class are called “instances of the class”.

I hope it is pretty obvious to you that since the program will
manipulate “book” objects, we need a Python class that will represent
a single book. Just knowing this is enough to let us suspend design
and write some code.

The Book Class

Before we start looking at the code we need to consider briefly how
Python programs are organized. Each program consists of a number of
modules, each module is contained in a file (usually named with the
extension “.py”) and the name of the file (without the “.py”) serves
as the module name. A module can contain any number of routines or
classes. Typically things that are related are kept in one module. For
example, there is string module that contains functions that
operate on strings. To access functions or classes from another module
we use the import statement. For example the first line of
the Book module is:

    from string import split, strip

which says that the routines split and strip are
obtained from the strings module.

Next, I have to point out few syntactic features of Python that are
not immediately obvious the code. The most important is the fact that
in Python indentation is part of the syntax. To see which statements
will be executed following an “if”, all you need to look at is
indentation – there is no need for curly braces, BEGIN/END
pairs or “fi” statements.

Here is a typical “if” statement extracted from the set_author
routine in the Book class:

	if new_author:
	    names = split (new_author, ",")
	    self.last_name.append (strip (names[0]))
	    self.first_name.append (strip (names[1]))
	else:
	    self.last_name = []
	    self.first_name = []

The three statements following the “if” are executed if “new_author”
variable contains a non-null value. The amount of indentation is not
important, but it must be consistent. Also note the colon (“:”) which
is used to terminate the header of each compound statement.

The Book class turns out to be very simple. It consists
of routines that set the values for author, title, subject and the URL
for each book. For example, here is the set_title routine:

    def set_title (self, new_title):
	self.title = new_title

The first argument to the “set_title” method (that is a routine which
belongs to a class) is “self”. This argument always refers to the
instance to which the method is applied. Furthermore, the attributes
(i.e. the data contained in each object) must be qualified with “self”
when referenced within the body of a method. In the example above the
attribute “title” of a “Book” object is set to value of “new_title”.

If in another part of a program we have variable “b” that references an
instance of a “Book” class this call would set the book’s title:

    b.set_title ("Fate is the Hunter")

Note that the “self” argument is not present in the call,
instead the object to which the method is applied (i.e. the object
before the “.”, “b” above) becomes the “self” argument.

At this point a reasonable question to ask is “Where do the objects
come from?” Each object is created by a special call that uses the
class name as the name of a function. In addition a class can define a
method with the name __init__ which will automatically be
called to initialize the new object’s attributes (in C++ such a
routine is called a constructor).

Here is the __init__ routine for the Book class:

    def __init__ (self, t="", a="", s="", u=""):
	#
	# Create an instance of Book
	#
	self.title = t
	self.last_name = []
	self.first_name = []
	self.set_author (a)
	self.subject = s
	self.url = u

The main purpose of the above routine is to create all the attributes
of the new “Book” object. Note that the arguments to “__init__” are
specified with default values, so that the caller needs only to pass the
arguments that differ from the default.

Here are some examples of calls to create “Book” objects:

    a = Book()
    b = Book ("Fate is the Hunter")
    c = Book ("Some book", "First, Author")

There is one small complication in the “Book” class. It is possible
for a book to have more than one author. That’s why the attributes
“first_name” and “last_name” are actually lists. We’ll look more at
lists in the next section.

The complete Book class is show in
Listing #1. To test the class we add a little piece of code at the end
of the file to test if the code is running as __main__ routine,
that is execution started in this file. If so, the code to test the Book
will run.

The Book_List Class

Once the Book is tested we can go back to designing. The next
obvious object is a list which will contain all the “book”
objects. For the purposes of our program we have to be able to create
the book list from the input file and we have to sort the books in the
list by author, title or subject. Sorted list will then be used as
input into the code that actually generates HTML pages.

As it turns out one of Python’s built-in data structures is a list. Here is
a snippet of code showing creation of a list and addition of some items
(this example was produced by running Python interactively):

 
Python 1.4 (Dec 18 1996)  [GCC 2.7.2.1]
Copyright 1991-1995 Stichting Mathematisch Centrum, Amsterdam
>>> s = []
>>> s.append ("a")
>>> s.append ("hello")
>>> s.append (1)
>>> print s
['a', 'hello', 1]

Above we create a list called “s” and add three items to it. Lists
allow “slicing” operations, which let you pull out pieces of a list by
specifying element numbers. These examples illustrate the idea:

>>> print s[1]
hello
>>> print s[1:]
['hello', 1]
>>> print s[:2]
['a', 'hello']
>>> print s[0]
a

s[1] denotes the second element of the list (indexing starts
at zero), s[1:] is the slice from the second element to the
end of the list, s[:2] goes from the start to the third
element, and s[0] is the first item.

Finally, lists have a “sort” operator which sorts the elements according to
a user supplied comparison function.

Armed with the knowledge of Python lists, writing the Book_List class
is easy. The class will have a single attribute, “contents”, which will be a
list of books.

The constructor for the Book_List class simply creates a
“contents” attribute and initializes it to be an empty list. The
routine that parses the input file and creates list elements is called
“make_from_file” and it begins with the code:

   def make_from_file (self, file):
	#
	# Read the file and create a book list
	#
	lines = file.readlines ()
	self.contents = []

The “file” argument is a handle to an open text file that contains the
descriptions of the books. The first step this routine performs is to
read the entire file into a list of strings, each string representing
one line of text. Next, using Python’s “for” loop we step through this
list and examine each line of text:

	#
	# Parse each line and create a list of Book objects
	#
	for one_line in lines:
	    # It's  not a comment or empty line 
	    if (len(one_line) > 0) and (one_line[0] != "#"):
    	            # Split into tokens
		    tokens = string.split (one_line)

If the line is not empty or is not a comment (that is the first
character is not a “#”) then we split the line into words, a word
being a sequence of characters without spaces. The call “tokens =
string.split (one_line)” uses the “split” routine from the “string”
module. “split” returns the words it found in a list.

		    
    if len (tokens) > 0:
      if (tokens[0] == "title:"):
        current_book = book.Book (string.join (tokens[1:]))
        self.contents.append (current_book)
      elif (tokens[0] == "author:"):
        current_book.set_author (string.join (tokens[1:]))
      elif (tokens[0] == "subject:"):
	current_book.set_subject (string.join (tokens[1:]))
      elif (tokens[0] == "url:"):
	current_book.set_url (string.join (tokens[1:]))

The first token (i.e. word) on the line is the keyword that tells us
what to do. If it is “title:” then we create a new Book
object and append it to the list of books, otherwise we just set the
proper attributes. Note that the remaining tokens found on each line
are joined together into a string (using “string.join” routine). There
is probably a more efficient way to code this, but for my purposes
this code works fast enough.

The other interesting parts of the Book_List class are the sort
routines. Here is how the list is sorted by title:

    def sort_by_title (self):
	#
	# Sort book list by title
	#
	self.contents.sort (lambda x, y: cmp (x.title, y.title))

We simply call “sort” routine on the list. To get proper ordering we
need to supply a function that compares two Book objects. For
sorting by title we have to supply an anonymous function, which is
introduced with the keyword “lambda” (those of you familiar with Lisp,
or other functional languages should recognize this construct). The definition:

      lambda x, y: cmp (x.title, y.title)

simply says that this is a function of two arguments and function result comes from calling the Python built-in function “cmp” (i.e. compare) on the “title” attribute of the two objects.

The other sort routines are similar, except that in “sort_by_author” I
used a local function instead of a “lambda”, because the comparison
was little more complicated – I wanted to have all the books with the
same author appear alphabetically by title.

Generating Pages:

Now that we have constructed a list of books, the next step is to create
the HTML pages. We begin by creating a class, called Html_Page, that
generates basic outline of a page and then we extend that class to create
the titles, authors and subjects pages.

The idea that existing code can be extended yet not changed is the
second most import idea of OO programming. The mechanism for doing
this is called “inheritance” and it allows the programmer to create a
new class by adding new properties to an old class and the old class
does not have to change. A way to think about inheritance is as
“programming by differences”. In our program we will create three
classes that inherit from Html_Page.

Html_Page is quite simple. It consists of routines that
generate the header and the trailer tags for an HTML page. It also
contains an empty routine for generating the body of the page. This
routine will be defined in descendant classes. The __init__
routine let’s the user of this class specify a title and a top level
heading for the page.

When I first tested the output of the HTML generators I simply printed
it to the screen and manually saved it into a file, so I could see the
page in a browser. But once I was happy with the appearance, I had to
change the code to save the data into a file. That’s why in Html_Page
you will see code like this:

	self.f.write ("<html>\n")
	self.f.write ("<head>\n")

for writing the output to a file referenced by the attribute “f”.

However, since the actual output file will be different for each page
opening of the file is deferred to a descendant class.

You can see complete code for Html_Page in Listing #3.

The three classes Authors_Page, Titles_Page and
Subjects_Page are used to create the final HTML pages. Since these
classes belong together I put them in one module, called books_pages.
Because the code for these is classes is very similar we will only look at
the first one.

Here is how Authors_Page begins:

class Authors_Page (Html_Page):

    def __init__ (self):
	Html_Page.__init__ (self, "Aviation Books: by Author",
			    "<i>Aviation Books: indexed by Author</i>")
	self.f = open ("books_by_author.html", "w")
	print "Authors page in--> " + self.f.name

To start with that the class heading lists the name of the class from
which Authors_Page inherits, mainly Html_Page. Next
notice that the constructor invokes the constructor from the parent
class, by calling the __init__ routine qualified by the class
name. Finally, the constructor names and opens the output file. I decided
not to make the file name a parameter for my own convenience to keep
things simple.

Since the book list is needed for to generate the body of each page I added
a book_list attribute to each page class. This attribute is set
before HTML generation starts.

The generate_body routine redefines the empty routine from
the parent class. Although fairly long, the code is pretty easy to
understand once you know that the book list is represented as an HTML
table and the “+” is the concatenation operator for strings.

In addition to replacing the generate_body routine we also redefine
generate_trailer routine in order to put a back link to the book index
at the bottom of each page:

    def generate_trailer (self):
	self.f.write ("<hr>\n")
	self.f.write ("<center><a href=books.html>Back to Aviation Books Top Page</a></center>\n")
	self.f.write ("<hr>\n")
	Html_Page.generate_trailer (self)

Notice how right after we generate the back link, we include a call to
parent’s generate_trailer routine to finish off the page with
correct terminating tags.

Complete listing for the three page generating classes are found in
Listing #4.

The main line of the entire program is shown in Listing #5. By now the code there
should be self explanatory.

Summary

As you can see this particular program was not hard to write. Python is
well suited for these types of tasks, you can quickly put together
a useful program with minimal fuss.

After I have got the program to work I realized that its design
is not the best. For example, the HTML generating code could be more
general, perhaps the Book class should generate it’s own
HTML table entries. But for now the program fits my purposes, but
I will modify if I need to create other HTML generating applications.

If you like to see the results of this script visit my book page.

To learn more about Python you should start with the Python Home Page which will point you
to many Python resources on the net. I also found the O’Reilly book
Programming in Python by Mark Lutz extremely helpful.

Finally, any mistakes in the description of Python features are
my own fault, as I’m still a Python novice.


 

Copyright © 1997, Richie Bielak

Published in Issue 19 of the Linux Gazette, July 1997

All source for this program is now available on github here.