When writing mine I've always devolved into writing two three sets. The get-er-done checklist, with a MUCH LONGER appendix about the architecture of the system including why things are done the way they are, probable sticking points when coming online, and abstract design assumptions. followed by a list of probable problems and their resolutions, followed by a longer section with information about how a system works, why it does it that way, and other information useful for pointing people in the right direction should something unique happen.
At my last job we were required to write doc so that even level-1 helpdesk people could bring things back up. This required checklists, which generally became out of date within 3 months of the writing. We were strongly urged to write troubleshooting guides whenever possible, but when the contingency tree gets more than three branches in it, you just can't write that doc without going abstract.
When leaving my last job, I turned in a 100 page 'how to do my job' manual before I left. It had the abstract stuff in it, design philosophy, as well as integration points. Since I was presumably writing for another sysadmin who was going to replace me, I aimed it at someone who could take abstract notions and turn them into concrete actions.
Five years have passed and I find my opinion on this has shifted somewhat. Both Document as Manual and Document as Checklist have very valuable places in the hierarchy of documentation and both need to be produced. They target very different audiences, though.
Document as Checklist
The target market for this kind of documentation are coworkers who want to how how to do a thing. They come in two types:
- Coworkers who just want to know how to do a thing and don't have time to thumb through a fifteen page manual and figure out the steps for themselves.
- Procedures that are fairly complex in steps, but only need to be run once in a while.
Impatience is the driver for the first kind. Maybe your coworker doesn't actually want to know why the output has to be piped through a 90 character perl regex, just that it has to be in order to close the ticket. Definitely include a statement like, "For a detailed explanation for why this workflow looks like this, follow this link," in the checklist for those that do want to know why.
The second point is for procedures that aren't run often but contain pitfalls. The checklist acts as a map to avoid the Certain Doom of just winging it. If the checklist is kept in a documentation repo, it saves having to search email for the time the old admin sent out a HOWTO.
In my opinion good checklist-documentation also includes sections on possible failure points, and responses to those failures. This can make the document rather large and trigger TL;DR responses in coworkers, so I find that making the failure-modes and their responses a link from the checklist rather than on the page itself makes for an unscary checklist. Embrace hypertextuality.
Document as Manual
The target market for this kind of documentation are people who want to learn more about how a system works. The how-to-do-a-thing style documentation should be able to be derived from this documentation, but more commonly I see it as a supplement to checklist-style documentation to back up the decisions made in the workflow.
This is the documentation where we include such chewy pieces like:
- Explaining why it's configured this way.
- This section may include such non-technical issues like the politics surrounding how the whole thing was purchased and installed.
- Explaining common failure modes and their responses.
- Explaining any service-level-agreements, both written and de facto.
- De facto: "if this fails during finals week it's a drop-everything problem. If during summer break, go back to sleep and deal with it in the morning."
- Setting out upgrade and refactoring goals.
- The politics may be different later, why don't we fix some of the bad ideas that introduced in the beginning?
Which are all very useful for obtaining a comprehensive understanding of the whole system. You don't need a comprehensive understanding to run simple human-automation tasks, you need it to figure out why something broke the way it did and have an idea where to make it not do that again.
You also mentioned Disaster Recovery documentation that has to be a checklist.
I understand, you have my sympathies.
Yes, DR documentation does need to be as checklist-like as possible.
Yes, DR documentation is the most resistant to checklisting due to how many ways things can break.
If your DR checklist looks like:
- Call Dustin or Karen.
- Explain the problem.
- Stand back.
You have a problem. That is not a checklist, that is an admission that the recovery of this system is so complex it takes an architect to figure out. Sometimes that's all you can do, but try to avoid it if at all possible.
Ideally DR documentation contains procedure checklists for a few different things:
- Triage procedures to figure out what went wrong, which will help identify...
- Recovery procedures for certain failure-cases. Which is supported by...
- Recovery scripts written well beforehand to help minimize human error during recovery.
- Manual-style documentation about the failure cases, why they occur and what they mean.
Triage procedures are sometimes all the DR documentation you can make for some systems. But having it means the 4am call-out will be more intelligible and the senior engineer doing the recovery will be able to get at the actual problem faster.
Some failure cases have straight-forward recovery procedures. Document them. While documenting them you may find cases where lists of commands are being entered in a specific order, which is a great use-case for scripting; it can turn a 96 point recovery procedure into a 20 point one. You'll never figure out if you can script something until you map the recovery procedure action by action.
Manual-style documentation for failure cases is the last ditch backstop to be used when there ARE no recovery procedures or the recovery procedures failed. It provides the google-hints needed to maybe find someone else who had that problem and what they did to fix it.
Unless there's going to be a lot of content switching tiers, I'd recommend separate wikis, as MW was never built for solid access control. Read http://www.mediawiki.org/wiki/Security_issues_with_authorization_extensions first and decide whether it's worth the effort. There's a lot of warnings and exploits that can circumvent the protection methods.
If you do go for it: have a look at the Namespace Lockdown extension. It lets you set group access control based on the namespace that the pages are in, then you can have one namespace for each tier. I have used this in the past (not sure how well it is supported on the current MW version, though). It works, but it's fiddly to configure and manage, especially if you've got lots of users.
If you go for two instances: You can certainly run more than one MW install on a single host, so long as you maintain good separation. Set them up as separate virtual hosts, with their own hostname, separate databases (and DB credentials) and you're away.
However, if you then want SSL, you'll need the generate a certificate for each (or use an internal wildcard one) and give each instance its own IP address as well as hostname.
The look+feel (skin) can easily be copied between the two instances, as it's just a PHP file with a subfolder. Get it how you like it on one, then copy it across and add it to your new config.
Best Answer
since 2003 I'm documenting everything in our inhouse wiki.
Servers
Workflows
e.g. how-to add or delete a user and give him/her access to all relevant services
Important links
Emergency instructions
what to do if intranet server/internet/web server/etc are down
Important:
Choose a wiki engine with easy export to PDF!
Its not useful if you are in holiday, the server running your wiki is down and no one knows what to do because your documentation is offline
Have a look at twiki, docuwiki or mediawiki.
BTW:
there is a OpenOffice.org plugin to write directly to mediawiki - very convenient.
EDIT:
Its also nice to write down some infos to
/home/adminuser/maintenance
. This is done quick and can be very helpfull, if several admins work on a server. eg: