HTML Validation: MultiTidy

Home
Back To Tips Page

I have recently been involved in a project that produces a massive set of HTML pages from an XML database. These pages feature all sorts of automation; for example, documentation sections are selected and arranged on an HTML page by reading possibly diverse pieces of the input XML file. More than that I cannot say explicitly because of NDA considerations.

The problem was that I'm generating well over a thousand HTML pages, and they needed to be checked for conformance to HTML syntax. I looked around and found a number of pieces of code; the best is the "tidy" library, found at SourceForge.net. This is a very nice library, but all of the applications associated with it were sort of the primitive model of Unix apps. One of the wrappers allowed specifying wildcards, but when it issued error messages, it didn't even identify the file that contained the error! (A Unix person would find this completely acceptable, but I have never liked the Unix philosophy of raw byte streams making sense).

So I wrote my own wrapper around the Tidy library, called MultiTidy. I can select as many files as I want, and it will run all of them through the tidy subroutine, and I will get a summary of all the errors, warnings, etc.

The problem with tidy is that it is very, very persnickety; it demands conformance to standards that are in many cases unrealistic; for example, it will not recognize the huge set of color names supported by every browser I am aware of. This generates a lot of unnecessary diagnostic messages. So I added the ability to filter out messages that are not relevant.

You can download the complete source, or just a VS6 MFC executable. Go to the downloads section.

Program overview

A sample of the screen, as applied to my own Web site, is shown here. On the left is a tree structure of the directory structure being checked. Each file is shown with a green checkmark (, file has no errors, warnings, or messages), a caution icon (,warnings present), an error icon (, errors present), or an information icon (, informational messages present). Clicking on a file expands it and the details of the information notes, warnings, or errors are shown in the subtree. Note that the icon displayed for the file is the "max" of the icons for individual messages. If a message is felt to be inappropriate, you can right-click on the message and add it to the list of filtered messages. If a message appears but is filtered, it does not appear. For example, FrontPage does not put a <DOCTYPE> declaration in the files, so this would appear as a warning on every file. I simply right-clicked the message, added it to the filter list, and if I wanted to then see the effects, I could right-click and select the option "re-tidy" to reapply tidy to the file in question. Double-clicking on an error message will locate the error text in the original file and highlight it; the tidy library does not give an upper bound of the limit of the error so I highlight everything to the end of the line, as shown above. Note that all errors are highlighted in both red and boldface, although the same issues of overkill to the end of the line apply. When the document is printed, this boldface text jumps out and makes the location of the errors apparent.

The dropdown marked "Error Summary" at the top just gives a list of all the error messages, sorted alphabetically. This is convenient if you are looking for patterns of errors and want to get a sense of what is going wrong.

The system validated all 118 of my HTML files in seven seconds, on a 2GHz Pentium 4.

Main Menu Documentation

File

Add file to list...

Brings up a standard file dialog and adds one or more files to the list of files to be processed.

Close file

Closes the file which is open in the right-hand panel.

Save current file

Saves the file displayed in the right-hand pane. Disabled until the file is modified.

Save current file as...

Saves the file under a different name.

Print HTML...

Prints the HTML text out. If a color printer is specified, the color highlighting of the errors will also be printed.

Exit

Exits the program.

Filters

Edit filter list...

Brings up the filter-edit dialog. In this dialog, each filter can be individually enabled or disabled. The filter list is retained between runs.

ü Use filter list

Indicates that the filters should be used. If unchecked, this effectively disables all filters, whether enabled or not.

Save Filters...

Saves the filters as an XML text file. The current filter filename is used. If there is no current filter filename, the command is the same as Save Filters As...

Save Filters As...

Saves the filters as an XML text file. The user is given a standard file dialog to supply the name of the filter file.

Load filters...

Loads a set of filter definitions from a file. These filter definitions completely replace the existing set of filter definitions.

Configuration

Set configuration file name...

Sets the name of the configuration file to be used when the program starts up. This filename is retained in the Registry.

Edit Configuration...

Brings up the configuration editor. This allows an interface to most of the features of the tidy library.

Line Wrap

This applies to the wrapping of lines if you ask the tidy library to "rewrite" your HTML to be conforming. It has no effect on the display of the raw HTML that came in.

Option Effect
Wrap margin   Sets the margin at which tidy wraps lines for display purposes. The default is 72.
Wrap attribute values   Allows attribute values to be split.
Wrap script literals   Allows script literals to be split across lines.
Break before <br>   Places a linebreak before each <br>
Wrap lines in ASP elements(<%...%>)   llows lines within an ASP to be line-wrapped
Wrap lines in JSTE elements (<#...#>)   Allows lines within a JSTE element to be line-wrapped
Wrap lines in section tags (<![...]>)   Allows lines within section tags to be line-wrapped.
Wrap lines in PHP elements   Allows lines within PHP elements to be line-wrapped.

Indentation

This applies to the indentation rules applied if you ask the tidy library to "rewrite" your HTML to be conforming. It has no effect on the display of the raw HTML that came in.

Option Effect
Block-Level Tags The indentation of the tags. Yes, No and Auto are as specified by the tidy library.
Indent attributes Allows attributes to be automatically indented if the line is too long; the attributes will continue, but indented under the enclosing environment.
Indent CDATA Indents CDATA elements.
Indentation The number of spaces of indentation to be used. The default is 3

Tags

This feature is not currently implemented.

Misc

Other features

Option Effect
Show warnings Shows warning and information messages.
Add Tidy meta element If tidy rewrites the HTML, a meta-element is added indicating that tidy has processed the file.
Replace hex color codes by names Replaces the hex color codes, such as #FF0000, with the appropriate color name.

Tree

Expand all

Expands all the entries that have warnings, info, or errors, so it is easy to see all the messages.

Expand Errors

Expands only those nodes which have either warning icons () or error icons (). Informational nodes () are not expanded.

 Collapse errors

Collapses all the expanded nodes.

Drop cleaned files

All files which are marked with a check mark () are dropped from the list. This leaves only the files with some form of annotation in the list.

Clear File List

Removes all names from the file list.

Tidy All!

Applies the tidy operation to every file in the list.

Help

About...

Gives the usual About box

Right-click menu items

Filename menu

Delete file from set of files

Removes the file from consideration. Note that this does not delete the file from the disk, only from the set of files shown in the left pane of the window. Since the most common reason to remove a file is that it is fully validated, see also Drop cleaned files.

Re-tidy text

Reapplies the tidy function to this file only. This is typically done after a number of filters have been added. To re-tidy all files, use the Tidy All! item of the main menu.

Replace with tidied text

The tidy library has the ability to rewrite the HTML to be conforming. This item will replace the contents of the text with the rewritten text. This text may then be saved replacing the current file or as a new file. If the text is not saved, the original file is not modified.

Save text

This is the same as the File Ø Save current file.

Save text as...

This is the same as the File Ø Save current file as...

Error Message menu

Show Source

This is the same as double-clicking the error message. The line containing the error message is displayed in the right panel. The file position indicated by the [line, char] information in the message is highlighted. Because the tidy library does not provide a length, everything to the end of the line will be highlighted. In most cases, this is sufficient to identify only the error, but it may highlight more material. Note that the window will occasionally mess up and highlight everything in red; in this case, just close the file by clicking the [X] icon in the title bar of the text window (not the one on the main caption bar) and reload the file.

Add to filter list

The text of this warning is added to the filter list. If Filters Ø Use filter list is checked, and the filter is checked in the filter list, this message will be ignored if issued for any file. Frequently after selecting a number of filters to apply, the text for this file can be re-tidied.

Ignore All

The message is removed from all files, but is not actually added to the filter list.

Delete Message

The message is deleted. No other change occurs. A re-tidying will cause the message to reappear.

Downloads

download.gif (1234 bytes) Executable only. MFC, VS6, requires MFC42.DLL.
download.gif (1234 bytes) Complete source, including the tidy library.

 [Dividing Line Image]

The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.

Send mail to newcomer@flounder.com with questions or comments about this web site.
Copyright © 1999-2011, The Joseph M. Newcomer Co.
Last modified: June 17, 2011