To understand this document

This document is all about how to avoid recurring errors in your sites configuration management. In policy, procedure, and in the structure of the system you deploy.

If you've ever make a mistake, you're good to go for this document. I also like the descriptions in the Field Guide to Human Error. Reading that first might help you understand this document.

What I've tried to provide is enough feedback and cross-checks in the operational processes and development work-flows to enable workers to compare their intent to the actual status of the automation they are driving. This is a balance that requires a large measure to diligence and a trust-but-verify attitude at all times.

No complex system is inherently safe, and the master source structure is just as dangerous as any other. The intent is to balance the power to make broad changes with the assurance that your actions are changing what you meant to change.

The goal of any configuration management structure is to "build what you promised from what you have, without errors, and in a timely manner", see the master source top-level document. Mistakes in execution are more likely to break ongoing production than errors in content. Since our goal is always to make the structure better, we should take steps to avoid either type failure.

To some extent, every configuration management structure is clumsy and complex. People balance and mitigate these issues to differentiate success from failure. This document explains the reasoning and tactics I use to train people, maintain site policy, and justify my salary.

Types of errors

I'm going to break errors into 5 common groups. I do this not to build a clever taxonomy, but to focus my mitigation efforts on specific cases. You might find other types and find better mitigations, but this is my document:
Missing a change in status, metrics, paths, names, or values
Errors caused by this include sending an uncommitted change to a production hosts, sending the wrong change to the target hosts, sending the correct source to the wrong population.
The proximal cause of this type of error is almost always a missing close-the-loop step.
Mode error (people loosing context)
The safety was off and you pulled the wrong trigger. This includes activation of the wrong target by being in the wrong directory, or running as the wrong login, or running a step out-of-order (aka at the wrong time).
The proximal cause of this type of error is usually interruption of the process or lack of a secondary checking your work. Sometimes a root cause is the normalization of deviance.
Data overload (aka. people loosing state)
Getting lost in a graphic display architectures, getting no (meaningful) feedback as process takes action, being unable to filter a signal from the noise. These contribute to data overload. This leads to bad decisions or inaction at the most critical points.
The cause of overload is usually an unfamiliar transition to a new display. For example when error reports cause the operator to loose context because the normal display has a different layout (than the error report screen). The next most common cause is not understanding the steps the automation is taking, so error reports appear disjoint from the (operator's conceptual understanding of the) process.
Lack of coordination when changing common configurations or specifications.
This results in a known-good input producing bad output.
The common cause of this is usually sharing configuration data between authoritative entities, or using multiple distinct policies that usually result in the same answer, but do not in some rare cases.
Priority mistakes
Selecting to fix the well-understood error rather than the more subtle or complex error which may be a larger threat.
This is usually caused by a misapprehension that all possible errors have already been seen: fixing the well-understood error in the past lead to success, so it should this time. This lengthens the time between observation and successful recovery of an error by the time to fix the trivial error, then observe that that had no effect on the failure. (Possibly extended to confirm that this normal fix didn't work.)

Proximal versus distal causes

Proximal causes are those closest to the trigger event. Those include keystroke errors, perspective and context errors, and timing errors. Distal causes are those setup before (or away from) the trigger event. Those include partial updates to input data (code, tables, databases, really any bits in-play), ineffective site policy or procedure, and lack of cross-group coordination. The terms "proximal" and "distal" are also used to describe the point of attachment to, and part of an appendage furthest from, the body of an animal.

In the new version of the master source I've tried to make all of the data proximal and available to the driver of each structure. The local recipe file (Makefile or Msrc.mk) and the platform recipe file (Makefile.host or Makefile) are both kept in the current working directory. No data is stored in a non-text format (we strongly prefer text files to database tables). There are command-line options to display the derived values of each step in the process, and options to dry-run most every operation.

These give the driver feedback. That feedback must be taken seriously by every driver. A push from the wrong directory, or with the wrong version of a key file is just about the worst thing you can do to any production operation. I also include the current working directory in my shell prompt, PS1.

Proximal pitfalls remain

For example specification of a production configuration file when a test file was required. This has actually happened to a valued team-member of mine. A request to send a configuration update to test systems was mistakenly sent to live production hosts. This resulted in a serious service failure. The two commands were different in exactly 4 characters:
$ msrc -Ctest.cf -P10 -E ....
vs:
$ msrc -Cprod.cf -P10 -E ....

To mitigate that we added a step to the procedure to stress that running a "do nothing update" before any push that might damage production is mandatory:

$ msrc -C.... : test
The : test command is selected because forgetting the colon runs an empty test command (which fails silently), and missing the test word doesn't hurt anything either. (Omitting the space fails to find the command :test, which is also harmless.)

The output from that command includes the list of instances updated. Which gives the driver two items that might trigger an abort reaction: an unexpected set of hostnames, or the length of the host list (either too large or too small) for the expected change targets. The attempt here is to offer feedback before the actual commit, and with history editing replacing the : test with make install is trivial.

The fix for the aborted update is also clear: if you got the wrong set of hosts, then you should use efmd to produce the correct list. That gives you the updated options for msrc, since they take identical options.

Keystroke errors

Each command is a step towards success or failure: choose with care.

Use the recipe file to record all the update commands you intend to run. Testing a recipe file's install target which updates 20 files on a test host is great. Keying in 20 install commands in a change window is insane. I don't know how anyone can justify the odds of a mistake in the later case.

For similar reasons I avoid punned recipe files. When a single make recipe named Makefile serves as both the master and platform recipe file, one might activate an update in the wrong context.

If you don't want a make recipe, embed command in a comment using mk markup. Never type a utility command of more than a few words. I try to avoid quoting shell meta character for the remote command as well. If you need a pipe, put it in the recipe. There may be an occasional need for a remote shell meta-character (usually &&), which is why msrc passes them quoted to each target host.

On the other end of the typing errors: recipe names should be long enough to avoid single letter mistakes. Steps named n and m are easy to mistype and mistake for each other.

Always have at least 3 eyes on every change

Two people should both check any keyboard-entered commands run for a change. Nobody I've ever seen has 3 eyes, so the rule is there should be at least 3 eyes on every step of a change that is entered on a keyboard.

People rely on there experience to recognize key patterns that indicate if things are going according to plan, or not. The idea is that the two people making a change share a common mental model of what is supposed to happen, and are constantly checking their mental model against what is actually happening and each other. This situational awareness is core to preventing mistakes. (This is also a core concept in pair programming, for the same reasons.)

I am assuming that all scripts used to make production change were reviewed and tested on a non-production set of hosts, well before the change window. If that is not the case, fix that first. There is little-to-no excuse to run any change without prior testing.

Checkouts that verify success

It is a great idea to have a checkout target in any update recipe. This should product no output if all is well, and a statement of what is missing or wrong which includes the instance name and an absolute path to at least one out-of-phase file.

This gives the person running the change a close-the-loop metric which enables them to close the change ticket with a positive assertion ("checkout complete") rather than a observation that they didn't see any obvious errors.

Note that the checkout recipe should never be a step in the update recipe. It might be run before the update to verify that the update has not been done, and it may well be run multiple times (by the third eye requirement) as a post-update step.

Processes which can't stop, are modal, or lack progress

Recovery from common operations should never be difficult. It always comes as a terrible surprise to an operator when something like an keyboard interrupt puts the structure in a nearly unrecoverable state. Mark any step must run to completion clearly.

Similarly a structure that changes modes without a clear request from the operator is really bad. The old complaint from emacs users that vi's modes were bad is ironic, in that emacs has even more modes, and it can change modes without keyboard input.

Just as bad are processes that offer no feedback. This is why file transfer programs (like scp) offer a updated progress metric as they copy data over the network. Show status for long processes. Status is more important than behavior: don't tell the operator about details they didn't already know about. People do not deal well with extra information they do not understand.

Half-time

A fair summary of the last two sections would be:
Knowledge of the current situation prevents mistakes.
This is true for errors in execution. It is not true for errors caused by some distant events.

The biological term for "the part of an organism that is distant from the point of inspection (or connection)" is distal. Failures that result from external forces, or actions taken by parties out-side of the driver's work-group (or span-of-control) are therefore distal sources of error.

Preventing impacts from distal sources of error

This is always harder. Policy makers that are disconnected from the situation in production operations come to some fabulously painful and degrading blame-bombs. It is far easier to blame the person at the keyboard for the four character error as the "weak link" in the process. It is also patently unfair to do so. Several distant sources contributed to cause that spelling error to have far greater impact than it might have had. For example the test hardware was different in many ways from the production hardware. This lack of alignment saved a little up-front money, and created an on-going operational tax on every change to the system.

So I don't subscribe to that doctrine: the weak link is almost always a process that provided little feedback or visibility (e.g. a GUI) or a procedure that had no useful cross-checks before the commit action was taken. The cross-check in this case was the cost of on-going changes and the added risk to those changes, versus the small savings in capital costs for the slightly larger disks.

Distant sources of data need to be observable: as the list of hosts we are about to update needs to be visible to the driver (as above). But the reasons for each step in the process needs to be just as clear to the driver. Steps which add no certainty to the process are of little value to the driver. What gives each step in the process value? Here is a list I would start with:

Clear results

The output from the process is organized and fairly easy to read (possibly with some training). Failure messages are distinct from success. For example, the UNIX™ norm is to say nothing for success and something (on stderr) only for a failure. These are quite distinct.

Actionable messages

The basic UNIX™ shell commands have a common pattern for error messages:

command: operation: noun: error-message
For example, I'll spell the null device wrong:
$ mk /dev/mull
mk: stat: /dev/mull: No such file or directory
Less informative applications elide the operation (or verb):
$ cat /dev/mull
cat: /dev/mull: No such file or directory

Such error messages don't explain to the driver which component of the path is wrong, but it gives her a finite number of places to inspect.

Cut points

If a key step fails, then any automation should stop as soon as possible. Never depend on the driver to interrupt the process from the keyboard. The failure should be as close to the last line of output as you can make it, and include a key phrase like "noun failed to verb". The best thing about these errors is that they are common across many tools, and the error message are available in most locales. They are also clearly spelled out for each system call in the manual pages for sections 2 and 3. That is not to say they are clear to a novice, but they are consistent and can be learned.

And nearly every base tool exits with a non-zero exit code when it fails. So check the status of any command that matters, and don't run commands that don't matter. So when coding applications include sysexits.h or require sysexits.ph.

Investigation of failures should include cross-checks from the point-or-view of any distal inputs. Any distal part that has a way to cross check our work should have an interface to test it ad hoc. Use these to recover from failures, prevent failures. Provide data sources for all client services that supply the data needed to check their sanity.

Restart points

When there is a possible termination-point in the process, there should be a clear on-ramp to resume the process after the issue is resolved. This may require a return to a previous step. This may even require a whole-sale back-out of the work already done. Live with that, and learn to accept temporary failure as long-term success.

This is also called "learning". Note that opposite (long-term failure) is the result of temporary success.

Verification of results

In the physical world we are bombarded by our senses with input data, so much so that we have to ignore most of it. In the digital world one must request data to see it.

Actions to prevent mistakes require not assuming that others have a similar understanding of the situation. Verification steps assure that the driver and their secondary agree on the status of the change.

Ignoring a chance to check a verification reduces the drivers situational awareness. This awareness is key to avoiding "normalization of deviance". If any output in the process looks funny then stop to confirm that output was (in some way) expected. Viewing all available data before taking actions (unplanned or planned). is the key to stable operations.

Configuration engineering is data-driven, ignoring data is always going to break your processes.

Mid-course plan changes are usually a sign of trouble

When we plan a change we predict a set of pathways the change will take. Some ordered time-boxed windows to achieve milestones. We also anticipate available resources before the change. The success of the change depends on these plans and factors.

So avoid dealing with newly emerging requirements or lack of resources in an event-driven or uncoordinated way. Discovering "new knowledge" as part of a planned change takes you out of the envelope of pathways to a safe outcome.

When events are not flowing as you expected, you must find out why; stop, plan again, fix the metrics, and retry only if you are still in the safe window. Temporary failure is an option, this is not a short-term game.

Assure that distant errors make it home

We have that base of best practice to build on, so we should add value and carry as much information to the driver as possible. That means we might prefix a standard error message with the name of the instance that produced the deviance:
instance: command: operation: noun: error-message

We also should carry exit codes back from remote commands.

We should build a structure to examine exit codes from each update, and take action for unexpected results.

Preventing systemic errors

To prevent systemic errors, look at our configuration management structure from the operators point of view. Each touch-point needs a close-the-loop operations: look from the operators position to locate the data that would be most useful for each of their decision points.

People running operation, development, and change management are working under rules that make sense given the context, indicators, operational status, and their organizational norms. Always look for the outcomes and messages that will cause them to take the best action after each step. Make them aware of something that is not "normal" and they may take action to avoid making it worse. Hiding failures, cross-checks, or other related data from them gives them no context to take compensatory actions.

Offered, produced, and requested information

Information is available in different measures. Some information is offered in the course of standard operating procedure, some is produced as part of the process, and other data is only available by explicit request.

The common GNU build is a great example of this. A README file in each product source directory is visible in the output of ls. This is being offered to the builder in a culturally normal way. Because the most common action of an operator after unpacking a source directory is to cd to it then run ls. In fact a source directory without one of these is quite rare.

Along the same lines, the existence of the file configure in the directory is usually a script built by autoconf. If the README instructs the operator to run that script, then they will usually do that. The expectation being that the operation of that script does no harm.

Moreover that configuration script produces the information it finds as part of the process of execution. And fails if it finds a missing facility.

After the product is built (installed) the operator may request the version of the application under a common command-line option. Usually one of -V, --version or less often by other options. This is compared to the expected version to assure that the update did the right thing.

This canonical chain (README to configure to -V) has changed very little in the last 20 years.

Your local site policy should call-out which style of information each application should provide. At some point the volume of information offered creates only meaningless mismatch failures; these unexpected outputs indicate expected (unavoidable) changes in irrelevant facts such as compile-times or OS level version details.

Close-the-loop checks take advantage of the version output. See hostlint, which may poll hundreds of tools for their installed version information. It uses a very simple table which lists the version, program path, and other details. Here is an example line:

2.17:bin/rcsvg -V:opt:local/bin/rcsvg/rcsvg.m
...
2.26:man1/rcsvg.1:opt:local/bin/rcsvg/rcsvg.man
This can actually do a 4-way check: the installed program, the master source version, the platform source, all against the table itself. The last line reports out-of-date revisions for then rcsvg manual page. Others could be added for the HTML documents, or any configuration files.

In fact very few files allow no comments or embedded markup for revision or version information. Having the same version identifier doesn't mean the files are identical: the binary for rcsvg is clearly not the same on an i386 as it is on an SPARC. A text configuration file may have been processed through a macro processor (m4, cpp, or the like) before installation.

Request for changes

Changes to production systems are always triggered by some need for the change. I would never update a production system just because the clock changed, or the up-stream sources incremented a version, release, or distribution number.

Changes happen because we need them. I have run production machines with uptimes of more than 2,800 days. There was no compelling reason to update the operating system, so no need to reboot them.

Someone must request each change. That's not to say that the same group issues every change request. Some changes are triggered by different steak-holders than others. Local site policy should state a clearly as possible who requests different changes.

Autonomous changes by automation

The first time a global update removes 100% of your production services will be the last automatic update ever performed. So we are assuming your updates are nearly perfectly tested, and carry virtually no risk of failure. In that case, one could apply updates to large instance populations automatically.

Auto-updates are the key to speedy changes for many free software services, as well as a few buggy OS vendors. Take the case of a consumer application program: we can accept that the provider has checked the update with great care. For example, a browser (viz. Chrome)) allows the fall-back to a different browser, which is not terrible. Or the reinstallation of the original package.

The common case in a data-center is less clear: mission critical services may create down-stream impacts that are harder to recover. Machines that don't boot make for very long recovery times. I don't apply configuration updates without people in the loop, the automation checks the updates -- it doesn't autonomously apply them.

How changes are requested

The only fixed part of the process is a clear audit trail. That could be in e-mail, a request queue, minutes of a meeting, or paper records. To cover any audit requests each site needs to keep a secure backup copy of the history of requests and their outcomes.

This might be as easy as setting up a pair of nntp news servers and publishing a log if each request and the log of each change supporting that request. I've done that for more than 25 years, with absolutely no regrets. This solution also allows operators to Cc: an e-mail gateway to the news service in Customer correspondences.

We can sum the last section:

Only observation protects us from the mistakes distal events trigger.

Stability versus the speed of change

When a tactical change is more important than operational stability you've already failed. Stability is always more important than any short-term plan.

Address the organizational pressures to choose schedule over system safety before a change starts; before it is planned, discussed or requested; before you take the job; and before you interview with the company.

Excuses you should shoot-down right away:

And the hardest one: "We made a similar change last week, this change it routine." That is normalization of deviance spot-on. Circumstances are truly similar, but that's when we miss a step. That's when our guard is down. That's when planes crash.

Lack of coordination when changing common configuration inputs produces the largest failures I've ever seen. And changing configuration inputs happens all the time: add a host, remove a switch, change the name of an application, move an individual to another group, change a password, add/delete group membership, move a home directory, update sudoers... the list is much longer than the list here.

Any of these can ruin your week.

To go faster we need better vision and knowledge. One way to get that is to use the computer itself to point out facts that contradict our mental model of the system. These take the form of loops: each loop samples a resource's state, then compares that with an automated check to see if that is the state we believe it has. Examples look checks:

Connect to a running service with a null transaction.
If we depend on the service, don't start an update without it.
Check for a known revision control tag in a file
Old software may cause known failures.
Verify the existence of a required application login
Why install an application without the supporting login?
Compare recorded identification strings with recorded values
For example a hosts MAC address, serial number, and RAM size to a manifest.
This is called a loop because while we "don't get the expected result" we fix, check, and loop again (until we are clear to proceed).

Master source specific checks and loop closures

Each master source directory usually has a TODO file kept under revision control. The file explains what the long term plan is for the directory. It helps new team members learn what is dead code and what needs work. Before we make a change to a master directory we review this file and update it before we start (as needed) and after we finish. Plans change as we get new requirements from adjacent layers, so we need to remove older ideas and explain new contexts. Thus bringing distant changes to local context and removing any out-dated ideas from that same context. It would be a fine idea to update local TODO files as a team periodically.

I couple RCS to my implementation (other local site policy implementations vary), so my first check is rcsdiff in the master source for every part of the change. After I change the source I verify that the test label is in-sync with either head or another expected revision, and that the current files match that label. If I need to hold a lock on the files I assure that each changed file is locked (by me). Another valid local site policy allow holding a lock on the recipe file to represent a lock on the whole directory.

The opposite side of that loop is assuring that files that are not changing are not different or locked. Because a locked file impacting my change may be distant from the directory where I am making my change, so we require another process. I upgraded tickle(8l) (which Dave Stevens wrote for our UNIX™ group at Purdue). A tickle job, run from cron, sends e-mail notifications to negligent source committers who's locks extend beyond some maximum number of days. Tickle is also run from the command-line to check for instability as part of some plans. Let's assume that good engineers will finish their work or release their locks when prompted and/or shamed.

At layer 2 we need to assume that all the files share the same symbolic label and are in-sync with a labeled branch. The msync(8l) program encodes the local site policy to do that. We do not force a recurring task to msync each master source directory, because it generated too much noise. The level2s program has a subcommand to run msync over every layer 2 directory, which may be run ad hoc to detail open source changes.

We need to package up the code as we are going to deploy. This is where level2s really comes in: the build subcommand creates a stable archive of the source at the current symbolic label. That archive may then be converted to an rpm, a Solaris package, or HP depot files, use as a native msrc directory, or any packaged up in any other installation/update structure.

For layer 3 we do the parallel operation on packages of products. This allows larger-scope changes to be committed atomically (with a single rpm or the like). The same assurances for the state of symbolic labels is also supported.

Next we need to trace the change at layer 4. On the running computers we have some really good loop closures. First we have the system logger, syslogd, which we run to the serial console. I know this sounds funny, but that really does prevent a Bad Guy from removing the log -- because it goes over a serial cable to a console server that is connected to the data-center operations network, not the common data networks. That means our logs from op, install, and possibly from msrc's entry script (aka local.defs) are more likely to actually make it operations, because they don't travel the common network.

The enemy we are fighting

We are not fighting people, we are not breaking bad habits, we are not telling workers they are lazy. We are not blaming people for being careless. Hind-sight makes every mistake look silly, but when you are in the dark tunnel and pressed for an answer, then you'll do what you can with the first idea that looks at all sane.

We are mitigating the cognitive consequences of computerization. The properties that make a computer useful make them hard for people to manage well.

Computers increase demands on people's memory
We didn't evolve to remember long strings of digits. Our memory skill is more relational and topological. Computers ask us to recall passwords, commands, options, addresses and we end-up writing them down or forgetting them.
The speed of our machines increase the risk of people falling behind
Who can keep up with a computer's speed?
Computers undermine people's formation of accurate mental models
In the simplified presentation of complex data we loose the "feel" for how the underlying process works. This is what causes operators to make terrible decisions on the fly.
Mistaking the simplified view for the system itself
A derivate of the last point: there is a knowledge calibration problem when an operator, engineer, or customer believes they understand processes that are represented by the automation, but in reality are much more complex. An "Exit" sign is not the exit, the door under it is; but on a computer display the exit button is the exit. Following that model in other contexts is a mental trap.
Compartmentalization limits the reach of relevant information
Without the holistic view people make mistakes. What is easy for one group may produce an unbearable burden on another team. Tiny changes in a API may mean a huge change for a client -- so when the 2 groups don't talk, the bad news is usually for everyone.

Winning means maintaining an acceptable speed

This is more on an on-going struggle, but there are tactics you should apply everyday to keep your group (and yourself) on the winning side.

Sometimes we slow to maintain a safe following distance, or speed up when we see clear road. This only works when we heed the metrics we'd learned to trust.

Postmortem error assessments rule out old issues first
Focus attention on the basis of earlier error assessments, then look for new causes. When the same issue comes up often, then you need to find a close-the-loop structure that works better than the current one. That's not going to be another check-box on a list, it is going to be automation which checks the works as soon as possible. And a cross-check that knows that check was executed or not (like hostlint and the reporter which calls-out hosts that didn't run it).
Limit scope of new learning
Good processes evolve from learning, they don't usually come from a revolutionary discontinuity. Add a check-point, delete an error-prone manual step for some automation, or get a third eye on a check-out before the point-of-no-return step. Ramp up training 1 tool at a time. For example, don't mix training on the msrc tools take them in order (sh, make, m4, ssh, rdist, xapply's markup, hxmd, msrc, and lastly mmsrc). By seeing some examples of what comes next in the early lessons we build a context to help staff engineers connect the dots for themselves. (This introduce tickle, rcsvg, mk, msync, level2s, level3s and such as they need them).
Keep similar circumstances truly similar.
If a few step-by-steps look really similar, then you may have an issue with cow path behavior. Make something in the process remind the driver which task they are going and assure that something in the processes is different enough to ring a bell, if they are doing the wrong process.
Make status changes visible
Not noticing changes (in status, metrics, paths, names, values) is way to easy (see the last point). Change something that matters in a command argument, so the cut+paste of a command from wrong processes fails early in the process. This also helps with mode errors. When a person doesn't cognate the computer has changed modes. This happens in GUIs, so we counter by changing colors and highlighting.
Adapt this to the way you do your work, or at least meet in the middle.
Set local site policy that requires processes to be noticeably different in their early steps. Mandate display architectures that highlight modes and even change the screen format when the mode is error recovery or creation versus update. A good example of this is the shells use of a octothorp (#) for a superuser shell, versus a dollar-sign ($) for a mortal shell.

Complacent engineers make mistakes

People notice that some changes are usually low risk. For example, when changing common configuration inputs to add or delete an instance, application, or some-such we usually do not break production. But there are real-time services (like DNS) that break whole sites when they go away or awry.

Lack of coordination in changes happens because of the feeling that nobody ever notices, so the change/approval process must be a waste of time. The push-back creates a longer, more detailed, harder process, which has a long-term effect of data overload -- too many change trackers open, too many people involved, too much effort to make an approved change. Often expressed as, "I just want to get my job done." More often resulting in sub rosa changes, which create wreckage as the running system is no long in alignment with site policy.

That sets an incorrect task priority: dodging the structure rather than using it. Make the change process as real-time as possible. Change review should be mostly a peer-review, not a management review. What does a manager really know about the risk of any given change? What new information can management offer that they didn't hear from the engineers? If the managers or the technicians are withholding important information, then we have a whole new problem.

Trust your people to use the structure because the structure helps them do better work, avoid overtime, and prevents loss to their organization. Point out that other groups depend on notifications so they can engineer compensating changes, have your team estimate the impact to every other team, and make them learn from their estimation errors.

Summary: use automation to match the state of systems to your internal model, proceed faster when they match and stop when they don't.

See also

There are some great source works for understanding human errors out there. And I intend to make this document better in the next release. --ksb

$Id: error.html,v 1.19 2013/12/07 00:31:17 ksb Exp $ by .