Odds and Ends - Terry Jones

terry @ fluidinfo.com

Here is a short description of some projects I worked on in the period 1989-1999. These are roughly in order of when I began them. Practically all of these were projects driven by my own needs, or those of my friends. In all but one or two cases, the programs, the approaches and algorithms they employed were all of my own construction. None of this work was ever published, though some of these things did appear in usenet newsgroups, principally in (the now defunct) comp.sources.unix.

In addition to these projects (which range in size from about 1000 lines to 15,000 lines), I have produced any number of smaller programs, mainly written in C, perl, tk/tcl and Emacs lisp. I have also produced several other large programs which have been published or are described elsewhere (principally Echo and AVS), and between 10 and 30 commercial web sites, depending on what you count.

A front-end for the vi editor. Kept a per-directory list of the most recently visited files and provided fast access to these (by number, by abbreviated pattern, by history). Editing the last file or second last file visited in a directory was very fast. Provided simple spelling correction (transpositions, omitted characters, extra character, missing character) on file names. Detected and avoided the vi modelines security problem. This program was written up in the /bin column of the August 1990 edition of BYTE as one of the author's favorite pieces of free software. The program was in wide use around the world from 1989 onwards, after appearing in comp.sources.unix.

A set of memory management algorithm simulation tools. Together with Peter Buhr of the University of Waterloo, I implemented approximately 6 memory management algorithms. I built a system to run these and measure various aspects of their performance. The system took files of memory size requests, delivered them to each algorithm under controlled conditions, and calculated performance results. The measurement system contained a timing package that allowed for the timing of multiple simultaneous code sections. This was done using the UNIX setitimer calls.

A thread-safe replacement for BSD malloc. When I discovered how horribly inefficient Berkeley malloc was, I wrote my own malloc. This pre-allocated a large piece of memory (size determined by an initialization call), and special queues for frequently used sizes. Initialization looked like

lmmm_init(64K, sizeof(struct1), sizeof(struct2), ...);

The reasoning being that many programs allocate many structures whose size they know ahead of time. The allocation algorithm first consulted these fixed-sized lists and failing a reasonable match reverted to something that looked like BSD malloc (without the gross repeated-sbrk inefficiencies and internal fragmentation caused by a 4K page size on our VAX). This was designed and written 1989, and is still in use in Waterloo's uKernel.

A little language for the specification of graphs. This allowed the simple production of a paper that required approximately 250 graphs. Provided for multiple edges, did layout of multiple graphs in a fixed horizontal space and was generally useful. The program produced pic output. 1989.

I wrote a package for the rapid construction of UNIX filters. I was processing many multi-megabyte files of words and needed more speed than awk. Perl barely existed. My package allowed options on the command line that were used to write a C program that was compiled and executed on the fly. The resulting executable could then be used again of course. It also provided the C code as a basis for the construction of more advanced filters. The package took advantage of /dev/tty, tricking the C compiler into compiling standard input, removing the need for an intermediate C file. Written in C in 1988.

A full Pascal compiler. This was written with a partner, and completed in 13 weeks. We did it all, except for the final steps in implementing sets. It produced VAX assembler and was more efficient than the UNIX pascal compiler in terms of size of code produced, time to compile, and run time of the resulting code (a requirement to pass the course). Written in C in 1988.

A novel peephole optimizer for the above compiler. This was based on Aho & Corasick's algorithm used in fgrep. The optimizer was provided with a file of patterns indicating the kinds of inefficiencies the compiler was known to produce. These took the form of matching rules and replacement actions. The optimizer (as a result of the algorithm) did just a single pass over the assembler code, and ran very quickly. Written in C in 1988.

M, a mail/man/mkdir/more/make substitute. This was based on the observation that which of these common commands was intended could usually be determined by examining the command's arguments. M worked very well, but was not widely adopted. Written in C in 1989.

A mail server that allowed a remote user to send UNIX commands in the mail and receive the output back via mail. This was before the days of telnet and ftp access to all machines. Security was minimal, based on incoming address and a password contained in the mail. I used the system extensively to access my account at Waterloo when I was traveling. Written in C in 1988.

Strsed, a sed-like C function for strings. This was written for inclusion in a TCL-like interpreted language, but evolved into a separate C function that appeared in comp.sources.unix in 1990. Allowed substitute and replace operations with full regular expressions (using either GNU or Henry Spencer regexp code) and transliteration, a la tr. Understood a full range of backslashed escape codes. Functionality was similar to what perl provides with its substitution and transliteration operators. As of 2007, strsed is still used in some parts of Apple's Darwin Ports collection. Written in C in 1990.

A company-wide telephone and address database. Built for a company of 300 people, allowed immediate access to names, phone numbers, addresses, etc., and also produced compact postscript output for the printing of phone lists. This was used by PCS Gmbh in Munich. 1991.

A hashing package which I have used in many application since written. This does simple hashing with collision resolution by examining a linked list of items in a bucket (most recently accessed first). Written in C in 1990.

A package for the processing of UNIX command-line options. Allows long and short names, abbreviations, does conversion of arguments and places converted results into variable address locations, detects option ambiguities, allows mandatory options, does simple range checking on numeric options, allows reading options from files, and produces a usage function that shows documentation of all known options. This has evolved over the years and is still, I think, superior to the options processing in almost all UNIX utilities I'm aware of. Written in 1990.

A dynamic trie-based escape sequence parser for a new version of xterm. This allowed for the recognition of escape sequences sent to a terminal emulator. New sets of sequences could be added or removed without disturbing parsing. This was used in an xterm replacement (emu) four or five of us worked on at PCS in Munich. This was in use towards the end of 1990. I believe it is still used by a communications program (seyon). Written in C.

A replacement for UNIX man command. I wrote a nicely designed algorithm for more controlled access to UNIX man pages. This presented the user with a list of matching pages and allowed rapid selection. It allowed partial command names to be given (very useful to see all pages starting with a string). Written in 1990 in C and still in daily use.

A program to take a source tree and recursively check it in to RCS. Added Header and Log fields to files and understood commenting conventions. Figured out file types based on extensions and contents and was generally VERY careful about its job. Used extensively to import large source code distributions (including multiple copies of X windows) by PCS in Munich. Written as an sh shell script in 1989.

An automated mailing list kit with news-to-mail and mail-to-news gateways. This was used to administer the internet newsgroups rec.juggling and rec.unicycling over the period 1991-93 (approximately). The mailing list software allowed for individual message delivery, or digest form. The system also maintained a newsgroup archive, which is the reason the entire archives of rec.juggling are now available. Written in 1990 in C and sh shell scripts.

A system for scheduling large numbers of processes across machines. A controlling process was run periodically by cron to gather outstanding jobs and allocate them to available machines. The controlling process would also kill jobs on machines that had interactive users, and schedule them for execution elsewhere. It had a simple mechanism for distributing load according to client processor power. The client process running on each of these machines took jobs from their spool area until done. Simple commands allowed a user to add jobs to the master process' spooling area. Jobs could be split into multiple (identical) sub-jobs that were run on different machines and their results would be collected later for delivery to the user. This system was used by 2 PhD researchers (including the author) to manage tens of thousands of experiment runs on networks of up to 30 workstations. Written in 1991, used in 1991, 1994 and 1995.

A tool for building and running feed-forward neural networks. This had a simple network specification language and provided a command line interpreter for training and using the network. 1990.

A package for the implementation of genetic algorithms. This was very general and very fast. Contained many operators for crossover, mutation and selection, allowed the user to add more, these could be selected from the command line, did good logging, had many options for output. Written in C in 1991-1994 and used by many people on the internet.

An implementation of Kaufmann's NK landscapes. This employed an algorithm I devised based on a "jump table" to randomize behavior. This was a difficult nut to crack, I had made several earlier attempts. In the end my algorithm used linear space (all previous solutions I'm aware of had used space that grew exponentially with k), which allowed for landscapes with very large k to be constructed. In use by various genetic and evolutionary algorithm researchers around the world. Written in 1994 in C.

I have fixed two problems in GNU emacs. Firstly, The long-standing problem of sensitivity to command-line option ordering. My solution was to sort the command line - emacs is probably one of the only programs that sorts its command line. This function is the called on the first line of executable code in emacs' main() function. I also provided code to examine the emacs load-path variable to detect instances of elisp files shadowing each other. This alerts people installing new versions of emacs to potential conflicts from previously installed site elisp. This function is run automatically as the last thing that is done whenever emacs is installed. I thus wrote the (current) first line of executable code in emacs and the last code that is executed when emacs is installed. Written in C and emacs lisp in 1994, 1995.

A package allowing the simple treatment of ranges of real numbers. This allowed a program to efficiently manage a set of real intervals and test for membership, insert holes, etc. Useful for other programs doing things like printing a range of pages from a document and needing to parse a range specification from the command line, and then begin to use it. Written in C in 1995.

A system for the automatic monitoring of topics of interest in a news feed. Maintained a set of regular expressions on a per-user basis, fetched new news from a server via NNTP, and built HTML for the user to allow the retrieval of articles of interest. The system contained over 20 administrator and user commands. This was used to process an incoming news feed of 130,000 articles a day. Written in perl and C in 1995 at the University of New South Wales.

wwwtable - a perl program for the production of HTML tables. Described at http://fluidinfo.com/wwwtable/ in use by many people and apparently included in a Japanese Linux distribution. Written in perl in 1996.

The spamometer, a program for detecting incoming spam mail. Described in http://fluidinfo.com/spamometer/. Written in perl in 1997. This took a probabilistic approach, and suggested Bayesian filtering (which I did not implement), and had a plugin architecture allowing others to write spam tests that returned probabilities. It also pre-dates the famous Paul Graham A Plan for Spam article and the Spam Assassin (which has a similar approach) by 5 years. Gary Robinson (Spam Assassin author) mentions the Spamometer as "the earliest spam filter work I've heard of" and it is mentioned in the page on the pre-history of Spam Assassin. The Spamometer is mainly of historical interest now, especially for its potential use as prior art in preventing patents.

Htm4l, a set of over 500 m4 macros for the production of HTML. Heavily used by Teclata S.L. in Spain and by various other people around the world. An early version is described in http://fluidinfo.com/htm4l/htm4l/. Written in m4 during 1996-1998.

A collection of utilities for getting web pages, analyzing their contents, submitting them to search engines, tricking search engines, analyzing web logs, checking web page links, etc.