If you've ever make a mistake, you're good to go for this document. I also like the descriptions in the Field Guide to Human Error. Reading that first might help you understand this document.
What I've tried to provide is enough feedback and cross-checks in the operational processes and development work-flows to enable workers to compare their intent to the actual status of the automation they are driving. This is a balance that requires a large measure to diligence and a trust-but-verify attitude at all times.
No complex system is inherently safe, and the master source structure is just as dangerous as any other. The intent is to balance the power to make broad changes with the assurance that your actions are changing what you meant to change.
The goal of any configuration management structure is to "build what you promised from what you have, without errors, and in a timely manner", see the master source top-level document. Mistakes in execution are more likely to break ongoing production than errors in content. Since our goal is always to make the structure better, we should take steps to avoid either type failure.
To some extent, every configuration management structure is clumsy and complex. People balance and mitigate these issues to differentiate success from failure. This document explains the reasoning and tactics I use to train people, maintain site policy, and justify my salary.
In the new version of the master source I've tried to make all of the
data proximal and available to the driver of
each structure. The local recipe file (Makefile
or Msrc.mk
) and the platform recipe file
(Makefile.host
or
Makefile
) are both kept in the
current working directory. No data is stored in a non-text format
(we strongly prefer text files to database tables). There are command-line
options to display the derived values of each step in the process, and
options to dry-run most every operation.
These give the driver feedback. That feedback must be taken seriously by
every driver. A push from the wrong directory, or with the wrong
version of a key file is just about the worst thing you can do to any
production operation. I also include the current working directory in
my shell prompt, PS1
.
vs:$ msrc -Ctest.cf -P10 -E ....
$ msrc -Cprod.cf -P10 -E ....
To mitigate that we added a step to the procedure to stress that running a "do nothing update" before any push that might damage production is mandatory:
The$ msrc -C.... : test
: test
command is selected because forgetting
the colon runs an empty test
command (which fails
silently), and missing the test
word doesn't
hurt anything either. (Omitting the space fails to find the command
:test
, which is also harmless.)
The output from that command includes the list of instances updated.
Which gives the driver two items that might trigger an abort reaction:
an unexpected set of hostnames, or the length of the host list
(either too large or too small) for the expected change targets.
The attempt here is to offer feedback before the actual commit,
and with history editing replacing the : test
with make install
is trivial.
The fix for the aborted update is also clear: if you got the wrong
set of hosts, then you should use efmd
to
produce the correct list. That gives you the updated options for
msrc
, since they take identical options.
Use the recipe file to record all the update commands you intend to
run. Testing a recipe file's install
target
which updates 20 files on a test host is great.
Keying in 20 install
commands in a change
window is insane. I don't know how anyone can justify the odds of
a mistake in the later case.
For similar reasons I avoid punned recipe files. When a single
make
recipe named Makefile
serves as both the master and platform recipe file, one might
activate an update in the wrong context.
If you don't want a make
recipe, embed command in
a comment using mk
markup. Never type a
utility
command of more than a few words.
I try to avoid quoting shell meta character for the remote command
as well. If you need a pipe, put it in the recipe. There may be
an occasional need for a remote shell meta-character (usually
&&
), which is why
msrc
passes them quoted to each target host.
On the other end of the typing errors: recipe names should be long
enough to avoid single letter mistakes. Steps named
n
and m
are
easy to mistype and mistake for each other.
People rely on there experience to recognize key patterns that indicate if things are going according to plan, or not. The idea is that the two people making a change share a common mental model of what is supposed to happen, and are constantly checking their mental model against what is actually happening and each other. This situational awareness is core to preventing mistakes. (This is also a core concept in pair programming, for the same reasons.)
I am assuming that all scripts used to make production change were reviewed and tested on a non-production set of hosts, well before the change window. If that is not the case, fix that first. There is little-to-no excuse to run any change without prior testing.
This gives the person running the change a close-the-loop metric which enables them to close the change ticket with a positive assertion ("checkout complete") rather than a observation that they didn't see any obvious errors.
Note that the checkout recipe should never be a step in the update recipe. It might be run before the update to verify that the update has not been done, and it may well be run multiple times (by the third eye requirement) as a post-update step.
Similarly a structure that changes modes without a clear request
from the operator is really bad. The old complaint from
emacs
users that vi
's
modes were bad is ironic, in that emacs
has
even more modes, and it can change modes without keyboard input.
Just as bad are processes that offer no feedback. This is why file
transfer programs (like scp
) offer a updated
progress metric as they copy data over the network. Show status for
long processes. Status is more important than behavior: don't tell
the operator about details they didn't already know about. People
do not deal well with extra information they do not understand.
Knowledge of the current situation prevents mistakes.This is true for errors in execution. It is not true for errors caused by some distant events.
The biological term for "the part of an organism that is distant from the point of inspection (or connection)" is distal. Failures that result from external forces, or actions taken by parties out-side of the driver's work-group (or span-of-control) are therefore distal sources of error.
So I don't subscribe to that doctrine: the weak link is almost always a process that provided little feedback or visibility (e.g. a GUI) or a procedure that had no useful cross-checks before the commit action was taken. The cross-check in this case was the cost of on-going changes and the added risk to those changes, versus the small savings in capital costs for the slightly larger disks.
Distant sources of data need to be observable: as the list of hosts we are about to update needs to be visible to the driver (as above). But the reasons for each step in the process needs to be just as clear to the driver. Steps which add no certainty to the process are of little value to the driver. What gives each step in the process value? Here is a list I would start with:
The output from the process is organized and fairly easy to read
(possibly with some training). Failure messages are distinct from
success. For example, the UNIX™ norm is to say nothing for
success and something (on stderr
) only for
a failure. These are quite distinct.
The basic UNIX™ shell commands have a common pattern for error messages:
For example, I'll spell the null device wrong:command
:operation
:noun
:error-message
Less informative applications elide the operation (or verb):$ mk /dev/mull mk: stat: /dev/mull: No such file or directory
$ cat /dev/mull cat: /dev/mull: No such file or directory
Such error messages don't explain to the driver which component of the path is wrong, but it gives her a finite number of places to inspect.
If a key step fails, then any automation should stop as soon as
possible. Never depend on the driver to interrupt the process from
the keyboard.
The failure should be as close to the last line of output as you
can make it, and include a key phrase like "noun
failed to verb
".
The best thing about these errors is that they are common across
many tools, and the error message are available in most locales.
They are also clearly spelled out for each system call in the
manual pages for sections 2 and 3. That is not to say they are
clear to a novice, but they are consistent and can be learned.
And nearly every base tool exits
with a
non-zero exit code when it fails. So check the status of
any command that matters, and don't run commands that don't matter. So when
coding applications include
sysexits.h
or require
sysexits.ph
.
Investigation of failures should include cross-checks from the point-or-view of any distal inputs. Any distal part that has a way to cross check our work should have an interface to test it ad hoc. Use these to recover from failures, prevent failures. Provide data sources for all client services that supply the data needed to check their sanity.
When there is a possible termination-point in the process, there should be a clear on-ramp to resume the process after the issue is resolved. This may require a return to a previous step. This may even require a whole-sale back-out of the work already done. Live with that, and learn to accept temporary failure as long-term success.
This is also called "learning". Note that opposite (long-term failure) is the result of temporary success.
In the physical world we are bombarded by our senses with input data, so much so that we have to ignore most of it. In the digital world one must request data to see it.
Actions to prevent mistakes require not assuming that others have a similar understanding of the situation. Verification steps assure that the driver and their secondary agree on the status of the change.
Ignoring a chance to check a verification reduces the drivers situational awareness. This awareness is key to avoiding "normalization of deviance". If any output in the process looks funny then stop to confirm that output was (in some way) expected. Viewing all available data before taking actions (unplanned or planned). is the key to stable operations.
Configuration engineering is data-driven, ignoring data is always going to break your processes.
When we plan a change we predict a set of pathways the change will take. Some ordered time-boxed windows to achieve milestones. We also anticipate available resources before the change. The success of the change depends on these plans and factors.
So avoid dealing with newly emerging requirements or lack of resources in an event-driven or uncoordinated way. Discovering "new knowledge" as part of a planned change takes you out of the envelope of pathways to a safe outcome.
When events are not flowing as you expected, you must find out why; stop, plan again, fix the metrics, and retry only if you are still in the safe window. Temporary failure is an option, this is not a short-term game.
instance
:command
:operation
:noun
:error-message
We also should carry exit codes back from remote commands.
We should build a structure to examine exit codes from each update, and take action for unexpected results.
People running operation, development, and change management are working under rules that make sense given the context, indicators, operational status, and their organizational norms. Always look for the outcomes and messages that will cause them to take the best action after each step. Make them aware of something that is not "normal" and they may take action to avoid making it worse. Hiding failures, cross-checks, or other related data from them gives them no context to take compensatory actions.
The common GNU build is a great example of this.
A README
file in each product source directory
is visible in the output of ls
.
This is being offered to the builder in a culturally normal way.
Because the most common action of an operator after unpacking a source directory
is to cd
to it then run ls
.
In fact a source directory without one of these is quite rare.
Along the same lines, the existence of the file
configure
in the directory is usually a
script built by autoconf
. If the
README
instructs the operator to run that
script, then they will usually do that. The expectation being that
the operation of that script does no harm.
Moreover that configuration script produces the information it finds as part of the process of execution. And fails if it finds a missing facility.
After the product is built (installed) the operator may request
the version of the application under a common command-line option.
Usually one of -V
, --version
or less often by other options.
This is compared to the expected version to assure that the
update did the right thing.
This canonical chain (README
to
configure
to -V
) has
changed very little in the last 20 years.
Your local site policy should call-out which style of information each application should provide. At some point the volume of information offered creates only meaningless mismatch failures; these unexpected outputs indicate expected (unavoidable) changes in irrelevant facts such as compile-times or OS level version details.
Close-the-loop checks take advantage of the version output. See
hostlint
, which
may poll hundreds of tools for their installed version information.
It uses a very simple table which lists the version, program path, and
other details. Here is an example line:
This can actually do a 4-way check: the installed program, the master source version, the platform source, all against the table itself. The last line reports out-of-date revisions for then2.17:bin/rcsvg -V:opt:local/bin/rcsvg/rcsvg.m ... 2.26:man1/rcsvg.1:opt:local/bin/rcsvg/rcsvg.man
rcsvg
manual page. Others could be
added for the HTML
documents, or any configuration files.
In fact very few files allow no comments or embedded
markup for revision or version information. Having the same version
identifier rcsvg
is clearly not the same
on an i386
as it is on an
SPARC
. A text configuration file may have
been processed through a macro processor (m4
,
cpp
, or the like) before installation.
Changes happen because we need them. I have run production machines with uptimes of more than 2,800 days. There was no compelling reason to update the operating system, so no need to reboot them.
Someone must request each change. That's not to say that the same group issues every change request. Some changes are triggered by different steak-holders than others. Local site policy should state a clearly as possible who requests different changes.
The first time a global update removes 100% of your production services will be the last automatic update ever performed. So we are assuming your updates are nearly perfectly tested, and carry virtually no risk of failure. In that case, one could apply updates to large instance populations automatically.
Auto-updates are the key to speedy changes for many free software services,
as well as a few buggy OS vendors.
Take the case of a consumer application program: we can accept that
the provider has checked the update with great care. For example,
a browser (viz.
Chrome)
)
allows the fall-back to a different browser, which is not terrible. Or
the reinstallation of the original package.
The common case in a data-center is less clear: mission critical services may create down-stream impacts that are harder to recover. Machines that don't boot make for very long recovery times. I don't apply configuration updates without people in the loop, the automation checks the updates -- it doesn't autonomously apply them.
This might be as easy as setting up a pair of nntp
news servers and publishing a log if each request and the log of
each change supporting that request. I've done that for more than
25 years, with absolutely no regrets. This solution also allows operators
to Cc:
an e-mail gateway to the news
service in Customer correspondences.
We can sum the last section:
Only observation protects us from the mistakes distal events trigger.
Address the organizational pressures to choose schedule over system safety before a change starts; before it is planned, discussed or requested; before you take the job; and before you interview with the company.
Excuses you should shoot-down right away:
And the hardest one: "We made a similar change last week, this change it routine." That is normalization of deviance spot-on. Circumstances are truly similar, but that's when we miss a step. That's when our guard is down. That's when planes crash.
Lack of coordination when changing common configuration inputs produces
the largest failures I've ever seen. And changing configuration inputs
happens all the time: add a host, remove a switch, change the name of
an application, move an individual to another group, change a password,
add/delete group membership, move a home directory, update
sudoers
... the list is much longer than
the list here.
Any of these can ruin your week.
To go faster we need better vision and knowledge. One way to get that is to use the computer itself to point out facts that contradict our mental model of the system. These take the form of loops: each loop samples a resource's state, then compares that with an automated check to see if that is the state we believe it has. Examples look checks:
Each master source directory usually has a TODO
file kept under revision control. The file explains what the long term
plan is for the directory. It helps new team members learn what is
dead code and what needs work. Before we make a change to a master
directory we review this file and update it before we start (as needed)
and after we finish. Plans change as we get new requirements from
adjacent layers, so we need to remove older ideas and explain new contexts.
Thus bringing distant changes to local context and removing any out-dated
ideas from that same context. It would be a fine idea to update local
TODO
files as a team periodically.
I couple RCS to my
implementation (other local site policy implementations vary), so my
first check is rcsdiff
in the master source for
every part of the change. After I change the source I verify that the
test label is in-sync with either head or another expected revision,
and that the current files match that label. If I need to hold a lock on
the files I assure that each changed file is locked (by me).
Another valid local site policy allow holding a lock on the recipe file to
represent a lock on the whole directory.
The opposite side of that loop is assuring that files that are not changing
are not different or locked.
Because a locked file impacting my change may be distant from the directory
where I am making my change, so we require another process. I upgraded
tickle
(8l)
(which Dave Stevens wrote for
our UNIX™ group at Purdue).
A tickle
job, run from cron
,
sends e-mail notifications to negligent source committers who's locks extend
beyond some maximum number of days. Tickle
is also
run from the command-line to check for instability as part of some plans.
Let's assume that good engineers will finish their work or
release their locks when prompted and/or
shamed.
At layer 2 we need to assume that all the files share the same symbolic
label and are in-sync with a labeled branch. The
msync
(8l)
program encodes the local site policy to do that. We do not force a
recurring task to msync
each master source
directory, because it generated too much noise. The
level2s
program has a subcommand to run
msync
over every layer 2 directory, which may
be run ad hoc to detail open source changes.
We need to package up the code as we are going to deploy. This is where
level2s
really comes in: the build
subcommand creates a stable archive of the source at the current symbolic label.
That archive may then be converted to an rpm
, a
Solaris package, or HP depot files, use as a native msrc
directory, or any packaged up in any other installation/update structure.
For layer 3 we do the parallel operation on packages of products. This allows
larger-scope changes to be committed atomically (with a single
rpm
or the like). The same assurances for the
state of symbolic labels is also supported.
Next we need to trace the change at layer 4. On the running computers we
have some really good loop closures. First we have the system logger,
syslogd
, which we run to the serial console.
I know this sounds funny, but that really does prevent a Bad Guy from
removing the log -- because it goes over a serial cable to a console
server that is connected to the data-center operations network, not
the common data networks.
That means our logs from op
,
install
, and possibly from
msrc
's entry script
(aka local.defs
) are more likely to
actually make it operations, because they don't travel the
common network.
We are mitigating the cognitive consequences of computerization. The properties that make a computer useful make them hard for people to manage well.
Sometimes we slow to maintain a safe following distance, or speed up when we see clear road. This only works when we heed the metrics we'd learned to trust.
hostlint
and the reporter which
calls-out hosts that didn't run it).
msrc
tools take them in order (sh
,
make
, m4
,
ssh
, rdist
,
xapply
's markup, hxmd
,
msrc
, and lastly mmsrc
).
By seeing some examples of what comes next in the early lessons we
build a context to help staff engineers connect the dots for themselves.
(This introduce tickle
, rcsvg
,
mk
, msync
,
level2s
, level3s
and
such as they need them).
#
) for a superuser shell, versus
a dollar-sign ($
) for a mortal shell.
Lack of coordination in changes happens because of the feeling that nobody ever notices, so the change/approval process must be a waste of time. The push-back creates a longer, more detailed, harder process, which has a long-term effect of data overload -- too many change trackers open, too many people involved, too much effort to make an approved change. Often expressed as, "I just want to get my job done." More often resulting in sub rosa changes, which create wreckage as the running system is no long in alignment with site policy.
That sets an incorrect task priority: dodging the structure rather than using it. Make the change process as real-time as possible. Change review should be mostly a peer-review, not a management review. What does a manager really know about the risk of any given change? What new information can management offer that they didn't hear from the engineers? If the managers or the technicians are withholding important information, then we have a whole new problem.
Trust your people to use the structure because the structure helps them do better work, avoid overtime, and prevents loss to their organization. Point out that other groups depend on notifications so they can engineer compensating changes, have your team estimate the impact to every other team, and make them learn from their estimation errors.
Summary: use automation to match the state of systems to your internal model, proceed faster when they match and stop when they don't.
$Id: error.html,v 1.19 2013/12/07 00:31:17 ksb Exp $ by ksb.