Lessons by Jon

Parsing Data in a CGI

As you saw in Lesson 1, the information passed to your CGI is pretty messy. That's because all of the data in the fields and buttons of the form used is concatenated into one long block of text. In order to allow you to decipher the text, two key characters, "=" and "&", are used to delimit the text. "&" is used to mark the end of data for each field while "=" is used to separate the name of the field from the data that was in it. The structure of this returned text looks like:

In order to make sense of this mess, you will need to break it back up into its various pieces of information about each field. To do this I recommend using the Tokenize OSAX. The Tokenize OSAX is designed to take in data that has key characters as delimiters and return a list of items with each item representing the data between two delimiters (read the documentation that came with Tokenize for a better explanation).

This lesson is the first one where you use an OSAX (AppleScript Language Extension). If you are not familiar with how they work, get a book on AppleScript and read up about it. As a basic explanation, OSAX act as extensions to the AppleScript language. They either allow you to do things you couldn't do in AppleScript (such as listing all aliases on a disk) or they do something very fast that would take many lines in AppleScript. I may provide some quick information later, but it probably won't be nearly as nice as the BMUG guide or Danny Goodman's book.

Required OSAX


NOTE: if you have not yet installed this OSAX, then do it before starting this lesson. The script will not compile without it. Go back to the Requirements section to download the OSAX if you need it.

Script4.txt - Parsing Data

Here is the entire script for this lesson. The comments have been removed so you see only the lines that actually get compiled. The full script, including comments and special characters, is in the archive with the name "Script4.txt".
property crlf : (ASCII character 13) & (ASCII character 10)
property http_10_header : "HTTP/1.0 200 OK" & crlf & "Server: MacHTTP" & crlf & Ā
	"MIME-Version: 1.0" & crlf & "Content-type: text/html" & crlf & crlf
property idletime : 300
property datestamp : 0

set datestamp to current date

on «event WWW½sdoc» path_args ¬
   given «class kfor»:http_search_args, ¬
      «class post»:post_args, «class meth»:method, ¬
      «class addr»:client_address, «class user»:username, ¬
      «class pass»:password, «class frmu»:from_user, ¬
      «class svnm»:server_name, «class svpt»:server_port, ¬
      «class scnm»:script_name, «class ctyp»:content_type
   set datestamp to current date

   set return_page to http_10_header ¬
      & "<HTML><HEAD><TITLE>Parsed Results</TITLE></HEAD>" ¬
      & "<BODY><H1>Parsed Results</H1>" & return
   set return_page to return_page & "<H4>post_args</H4><PRE>" & return

   set postarglist to tokenize post_args with delimiters {"&"}

   set postargtext to ""
   repeat with curritem in postarglist
      set postargtext to postargtext & curritem & return
   end repeat

   set return_page to return_page ¬
      & postargtext & "</PRE>" & return
   set return_page to return_page ¬
      & "<HR><I>Results generated at: " & (current date) ¬
      & "</I>" & "</BODY></HTML>"
   return return_page

 on error errMsg number errNum
   set return_page to http_10_header ¬
      & "<HTML><HEAD><TITLE>Error Page</TITLE></HEAD>" ¬
      & "<BODY><H1>Error Encountered!</H1>" & return ¬
      & "An error was encountered while trying to run this script." & return
   set return_page to return_page ¬
      & "<H3>Error Message</H3>" & return & errMsg & return ¬
      & "<H3>Error Number</H3>" & return & errNum & return ¬
      & "<H3>Date</H3>" & return & (current date) & return
   set return_page to return_page ¬
      & "<HR>Please notify Jon Wiederspan at " ¬
      & "<A HREF=\"mailto:jonwd@tjp.washington.edu\">jonwd@tjp.washington.edu</A>" ¬
      & " of this error." & "</BODY></HTML>"
   return return_page
 end try
end «event WWW½sdoc»

on idle
   if (current date) > (datestamp + idletime) then
   end if
   return 5
end idle

on quit
   continue quit
end quit

Step By Step

This script is almost identical to that in Lesson 3. The big difference is that we are building the return_page a little differently. Instead of returning the data in post_args in raw form, the way we did in previous lessons, we are now using the Tokenize OSAX to break the data in post_args into a list and doing some formatting of the list before returning it. Because the other arguments are pretty much useless at this point, I am not doing anything with them. Our only focus is the post_args.

The first line to look at is this one:

   set postarglist to tokenize post_args with delimiters {"&"}
This line takes in the data in post_args and separates it into a list using the "&" as the item delimiter (Note: As a side effect, all of the "&" characters in the text are removed). The output (postarglist) is a list which looks something like this:
(Note: This is not text. This is a list, which is a special type in AppleScript. If you don't know what lists are, refer back to your AppleScript guide).

This is a very workable format now for getting at the real information that the user typed into the form. The next step is to run through the list one item at a time and make a nice, readable format of the data. The following code takes care of that:

   set postargtext to ""
   repeat with curritem in postarglist
      set postargtext to postargtext & curritem & return
   end repeat
This code uses the "repeat with" control structure to repeat an action for each item in a list. All that is done is to append each item to postargtext as text with a return at the end. You will see the difference when you test the script below.

I've chosen to use an OSAX here to help with parsing the data. There are two reasons why this is a good thing to do.

  1. OSAX are almost always much, much faster than equivalent code written in AppleScript. They take what would be several long, slow steps in AppleScript and write them in some other language, like C or Pascal, where they can be executed very quickly. This isn't like the difference between the tortoise and the hare, though. It's more like the tortoise and the Lamborghini Countach.
  2. Using an OSAX can greatly reduce the size and complexity of your script. Usually an OSAX takes a long section of code and replaces it with one (hopefully) easy to read command.

Of course, noone is forcing you to use these OSAXen. Perhaps you're one of those people who still try to write major applications for the Apple II just to show that it can be done. If so, you may be interested in the code segment below. This is what you would have had to type in AppleScript to do the same parsing without using Tokenize:

   set oldDelim to AppleScript's text item delimiters
   set AppleScript's text item delimiters to {"&"}
   set postarglist to text items of post_args
   set AppleScript's text item delimiters to oldDelim
I know what you're thinking. You're saying "That wasn't so bad. What a wimp he is, running off to find an OSAX to replace four measly lines". Well, I haven't timed the script to see how much faster it is, but I can guarantee that it will be noticeable if you start parsing a large form with 30 or 50 different items in it. (If you do try this and it isn't faster, I don't want to know.)

Test the Script

There are two forms to try this time. Both of them have many more fields and other elements than did the previous forms. One returns the data unparsed and the other returns the parsed data. Compare the results of the two until you have a good feel for what is being done to the data. Also, be sure to look at the structure of each item in the parsed data. You will notice the "name=data" format there.

Wrap It Up

Now that you have learned something of how to parse the information that is passed in post_args, you should be able to see why there are some problems with the use of certain characters in the form fields. If someone types a "&" into one of the fields, or if you have a field with "&" in its name, that could mess up the parsing above. To avoid such problems, the client software (NOT the server, remember?) converts all "&" to "%28" before passing the data back to the server. The clients also do the same for "=" characters, which are used to separate the field_name and field_data. This means that people can still use those two characters in the form fields, although it takes a little extra processing to convert them back to readable form.

There is another similar problem. As I may have said several times now, there are some clients, notably NCSA Mosaic and Netscape, that encode spaces as "+" instead of as "%20". This is a pain in the neck not only because it creates a special case while decoding, but also because it means there is another special character (the + sign) that needs to be encoded by the client. We will address this in later lessons, but if it bugs you as much as it does me, send a note to the developers and ask them to please stop doing this.

Jon Wiederspan
Last Edited: December 11, 1994