next up previous contents
Next: Filter API Up: Filtering System Previous: ObjectStream   Contents

HTML Parser

The HTML Parser component of the filtering system is responsible for parsing HTML documents and converting them into HTML Token objects which can then be sent to HTML filters for processing. The HTML Parser generates HTML Token objects which are an extended version of ByteArrays. Token objects can be used to represent four types of tokens found in HTML documents:

When an HTML filter receives a Token object from its object stream it can obtain the token type from the object and process the token as needed. Tokens that represent HTML tags can be converted to HTML Tag objects which provide easy access to the tag name and its attributes. Filters can also make use of regular expressions to compare tags and search for tags and tag attributes using patterns. All HTML parsing is performed by the HTML Parser to offload the parsing task from the individual filters. HTML filters only need to be concerned with the pre-parsed stream of HTML tokens.

The HTML Parser encountered a few problems when trying to parse the wide variety of HTML documents found on the web. The most frequently found problem was missing quotes in HTML tag attributes. Figure 4.3 illustrates this with three HTML anchors. The HREF value in the first anchor is fully quoted and therefore can be easily parsed looking first for the start quote and then for the end quote. The HREF in the second anchor is unquoted and therefore slightly more difficult to parse since it isn't as easy to determine where the value starts and ends. The HREF in the third anchor is missing the end quote and therefore presents a challenge for the HTML parser. Analysis must now be done to determine if the parser should continue searching for the end quote while at the same time attempting to recover from the error and continue parsing the document.

Figure 4.3: Missing Quotes
\begin{figure}
\begin{center}
\begin{verbatim}<A HREF=''http://www.example.com/'...
...ttp://www.example.com/> missing quote </A>\end{verbatim}\end{center}\end{figure}

The HTML Parser first used a very simple parsing algorithm which assumed all HTML documents were syntactically correct according to the HTML standards. This meant that the parser assumed all tag attributes were correctly quoted. In practice however it was found that improvements were needed because several popular web sites could not be parsed correctly. The next step was to attempt to recover from bad HTML syntax and more specifically from missing quotes in HTML tag attributes. A new parser algorithm was developed to detect when a quote was missing and continue parsing the remainder of the document. What this algorithm does is keep track of the characters that have been read while looking for an end quote. If the three characters >, <, and > are read in that order with any number of characters in between, the parser will assume a quote was missing and continue parsing. That sequence of characters was chosen because they represent characters used to specify HTML tags. The first > is intended to match the end of the tag currently being parsed. It is possible to stop searching at this point, but > characters can also be found inside tag attributes. After the parser sees the < followed by another > it can almost be guaranteed that an HTML tag has just been read and a quote was indeed missing.

Detecting missing quotes and recovering solved almost all of the HTML parsing problems. Another problem was found with the parser not being JavaScript aware. When JavaScript was first released it could not be inserted as-is into HTML documents because it would break every HTML parser. This is due to the fact that the JavaScript language includes characters that are normally reserved for use in HTML tags, namely < and >. A workaround for this was to include JavaScript within HTML comment tags which should not be parsed by HTML parsers. Figure 4.4 shows an example of how this is done.

Figure 4.4: JavaScript Commenting
\begin{figure}
\begin{center}
\begin{verbatim}<script><!-- Hide script from o...
...ite(''Hello world'');// --></script>\end{verbatim}\end{center}\end{figure}

The HTML 4.0 specification includes support for scripting languages and defines the SCRIPT tag. HTML parsers that follow this specification do not need the above commenting workaround since they know how to properly handle SCRIPT tags. It was found that some web sites assume HTML 4.0 compliance and therefore do not include the HTML comments for backward compatibility. Since the Muffin HTML Parser did not have special support for SCRIPT tags, these pages could not be parsed correctly. Therefore, special SCRIPT tag support was added to the Muffin HTML Parser to handle SCRIPT tag contents just like HTML comments.


next up previous contents
Next: Filter API Up: Filtering System Previous: ObjectStream   Contents
Mark R. Boyns
1999-01-12