HTML Validation: MultiTidy
I have recently been involved in a project that produces a massive set of HTML pages from an XML database. These pages feature all sorts of automation; for example, documentation sections are selected and arranged on an HTML page by reading possibly diverse pieces of the input XML file. More than that I cannot say explicitly because of NDA considerations.
The problem was that I'm generating well over a thousand HTML pages, and they needed to be checked for conformance to HTML syntax. I looked around and found a number of pieces of code; the best is the "tidy" library, found at SourceForge.net. This is a very nice library, but all of the applications associated with it were sort of the primitive model of Unix apps. One of the wrappers allowed specifying wildcards, but when it issued error messages, it didn't even identify the file that contained the error! (A Unix person would find this completely acceptable, but I have never liked the Unix philosophy of raw byte streams making sense).
So I wrote my own wrapper around the Tidy library, called MultiTidy. I can select as many files as I want, and it will run all of them through the tidy subroutine, and I will get a summary of all the errors, warnings, etc.
The problem with tidy is that it is very, very persnickety; it demands conformance to standards that are in many cases unrealistic; for example, it will not recognize the huge set of color names supported by every browser I am aware of. This generates a lot of unnecessary diagnostic messages. So I added the ability to filter out messages that are not relevant.
You can download the complete source, or just a VS6 MFC executable. Go to the downloads section.
A sample of the screen, as applied to my own Web site, is shown here. On the left is a tree structure of the directory structure being checked. Each file is shown with a green checkmark (, file has no errors, warnings, or messages), a caution icon (,warnings present), an error icon (, errors present), or an information icon (, informational messages present). Clicking on a file expands it and the details of the information notes, warnings, or errors are shown in the subtree. Note that the icon displayed for the file is the "max" of the icons for individual messages. If a message is felt to be inappropriate, you can right-click on the message and add it to the list of filtered messages. If a message appears but is filtered, it does not appear. For example, FrontPage does not put a <DOCTYPE> declaration in the files, so this would appear as a warning on every file. I simply right-clicked the message, added it to the filter list, and if I wanted to then see the effects, I could right-click and select the option "re-tidy" to reapply tidy to the file in question. Double-clicking on an error message will locate the error text in the original file and highlight it; the tidy library does not give an upper bound of the limit of the error so I highlight everything to the end of the line, as shown above. Note that all errors are highlighted in both red and boldface, although the same issues of overkill to the end of the line apply. When the document is printed, this boldface text jumps out and makes the location of the errors apparent.
The dropdown marked "Error Summary" at the top just gives a list of all the error messages, sorted alphabetically. This is convenient if you are looking for patterns of errors and want to get a sense of what is going wrong.
The system validated all 118 of my HTML files in seven seconds, on a 2GHz Pentium 4.
Brings up a standard file dialog and adds one or more files to the list of files to be processed.
Closes the file which is open in the right-hand panel.
Saves the file displayed in the right-hand pane. Disabled until the file is modified.
Saves the file under a different name.
Prints the HTML text out. If a color printer is specified, the color highlighting of the errors will also be printed.
Exits the program.
Brings up the filter-edit dialog. In this dialog, each filter can be individually enabled or disabled. The filter list is retained between runs.
Indicates that the filters should be used. If unchecked, this effectively disables all filters, whether enabled or not.
Saves the filters as an XML text file. The current filter filename is used. If there is no current filter filename, the command is the same as Save Filters As...
Saves the filters as an XML text file. The user is given a standard file dialog to supply the name of the filter file.
Loads a set of filter definitions from a file. These filter definitions completely replace the existing set of filter definitions.
Sets the name of the configuration file to be used when the program starts up. This filename is retained in the Registry.
Brings up the configuration editor. This allows an interface to most of the features of the tidy library.
This applies to the wrapping of lines if you ask the tidy library to "rewrite" your HTML to be conforming. It has no effect on the display of the raw HTML that came in.
|Wrap margin||Sets the margin at which tidy wraps lines for display purposes. The default is 72.|
|Wrap attribute values||Allows attribute values to be split.|
|Wrap script literals||Allows script literals to be split across lines.|
|Break before <br>||Places a linebreak before each <br>|
|Wrap lines in ASP elements(<%...%>)||llows lines within an ASP to be line-wrapped|
|Wrap lines in JSTE elements (<#...#>)||Allows lines within a JSTE element to be line-wrapped|
|Wrap lines in section tags (<![...]>)||Allows lines within section tags to be line-wrapped.|
|Wrap lines in PHP elements||Allows lines within PHP elements to be line-wrapped.|
This applies to the indentation rules applied if you ask the tidy library to "rewrite" your HTML to be conforming. It has no effect on the display of the raw HTML that came in.
|Block-Level Tags||The indentation of the tags. Yes, No and Auto are as specified by the tidy library.|
|Indent attributes||Allows attributes to be automatically indented if the line is too long; the attributes will continue, but indented under the enclosing environment.|
|Indent CDATA||Indents CDATA elements.|
|Indentation||The number of spaces of indentation to be used. The default is 3|
This feature is not currently implemented.
|Show warnings||Shows warning and information messages.|
|Add Tidy meta element||If tidy rewrites the HTML, a meta-element is added indicating that tidy has processed the file.|
|Replace hex color codes by names||Replaces the hex color codes, such as #FF0000, with the appropriate color name.|
Expands all the entries that have warnings, info, or errors, so it is easy to see all the messages.
Expands only those nodes which have either warning icons () or error icons (). Informational nodes () are not expanded.
Collapses all the expanded nodes.
All files which are marked with a check mark () are dropped from the list. This leaves only the files with some form of annotation in the list.
Removes all names from the file list.
Applies the tidy operation to every file in the list.
Gives the usual About box
Removes the file from consideration. Note that this does not delete the file from the disk, only from the set of files shown in the left pane of the window. Since the most common reason to remove a file is that it is fully validated, see also Drop cleaned files.
Reapplies the tidy function to this file only. This is typically done after a number of filters have been added. To re-tidy all files, use the Tidy All! item of the main menu.
The tidy library has the ability to rewrite the HTML to be conforming. This item will replace the contents of the text with the rewritten text. This text may then be saved replacing the current file or as a new file. If the text is not saved, the original file is not modified.
This is the same as the File Ø Save current file.
This is the same as the File Ø Save current file as...
This is the same as double-clicking the error message. The line containing the error message is displayed in the right panel. The file position indicated by the [line, char] information in the message is highlighted. Because the tidy library does not provide a length, everything to the end of the line will be highlighted. In most cases, this is sufficient to identify only the error, but it may highlight more material. Note that the window will occasionally mess up and highlight everything in red; in this case, just close the file by clicking the [X] icon in the title bar of the text window (not the one on the main caption bar) and reload the file.
The text of this warning is added to the filter list. If Filters Ø Use filter list is checked, and the filter is checked in the filter list, this message will be ignored if issued for any file. Frequently after selecting a number of filters to apply, the text for this file can be re-tidied.
The message is removed from all files, but is not actually added to the filter list.
The message is deleted. No other change occurs. A re-tidying will cause the message to reappear.
|Executable only. MFC, VS6, requires MFC42.DLL.|
|Complete source, including the tidy library.|
The views expressed in these essays are those of the author, and in no way represent, nor are they endorsed by, Microsoft.