Pretty HTML

Assignment Overview

In this assignment you'll be writing a program that converts an HTML file into a format that is easier to read.

Background

When one "pretty-prints" a file, the contents of the file are edited to produce a more attractive looking document, usually for printing or reading purposes. (See Wikipedia for more information.)

Program Specification

Write a Python program prettyprint.py which:

  1. reads an HTML document from a text file
  2. converts its HTML tags and content into tokens
  3. uses a binary tree and a stack (imported from atds.py) to analyze the tokens
  4. writes the new version of the document to a file

Deliverables

prettyprint.py

Assignment Notes

An HTML document, which is used to describe the contents of a webpage, consists of a series of markup "tags"--easily identified by angle brackets that surround them--and content. A simple example:

<html>
    <head>
        <title>
            My favorite equation of all time
        </title>
    </head>
    <body>
        <p>
            c<sup>2</sup> >= a<sup>2</sup> + b<sup>2</sup>
        </p>
    </body>
</html>

Each tag has an opening and closing angle bracket, and tags themselves occur in pairs, with the second tag of a pair including a forward-slash (/) indicating the closing of that part of the document.

So, <html> indicates the beginning of an html document, and </html> indicates the closing of the html document. <p> indicates the beginning of a paragraph, and </p> indicates the end of the paragraph, and so on.

Between any opening and closing tags are the contents of that block.

To "prettify" an html document we want to be able to convert it, regardless of its original formatting, to the style demonstrated above.

  1. Each "block-level" tag occurs on its own line. These include <html> tags, <head>, <title>, <body>, <p>, and <div> tags.
  2. Other tags (such as <sup> above) an occur in-line.
  3. Content enclosed by a pair of tags must be indented 4 spaces.

Note that any given webpage is displayed correctly according to HTML syntax, regardless of the formatting of the HTML document itself. So, this code:

<html><head><title>My favorite equation of all time</title></head><body>
        
        
        
            <p>c<sup>2</sup> >= a<sup>2</sup> + b<sup>2</sup></p></body></html>

will display exactly the same in a webpage as the "pretty" code above. This code is just harder for a programmer to read and work with.

The prettyprint.py program takes a file with ugly code, and convert it to pretty code as shown in the first example above.