ECC RAM can recover from small errors in bits, by utilizing parity bits. Since servers are a shared resource where up-time and reliability are important, ECC RAM is generally used with only a modest difference in price. ECC RAM is also used in CAD/CAM workstations were small bit errors could cause calculation mistakes which become more significant problems when a design goes to manufacturing.
It's really, really, really hard. It requires a very complete audit. If you're very sure the old person left something behind that'll go boom, or require their re-hire because they're the only one who can put a fire out, then it's time to assume you've been rooted by a hostile party. Treat it like a group of hackers came in and stole stuff, and you have to clean up after their mess. Because that's what it is.
- Audit every account on every system to ensure it is associated with a specific entity.
- Accounts that seem associated to systems but no one can account for are to be mistrusted.
- Accounts that aren't associated with anything need to be purged (this needs to be done anyway, but it is especially important in this case)
- Change any and all passwords they might conceivably have come into contact with.
- This can be a real problem for utility accounts as those passwords tend to get hard-coded into things.
- If they were a helpdesk type responding to end-user calls, assume they have the password of anyone they assisted.
- If they had Enterprise Admin or Domain Admin to Active Directory, assume they grabbed a copy of the password hashes before they left. These can be cracked so fast now that a company-wide password change will need to be forced within days.
- If they had root access to any *nix boxes assume they walked off with the password hashes.
- Review all public-key SSH key usage to ensure their keys are purged, and audit if any private keys were exposed while you're at it.
- If they had access to any telecom gear, change any router/switch/gateway/PBX passwords. This can be a really royal pain as this can involve significant outages.
- Fully audit your perimeter security arrangements.
- Ensure all firewall holes trace to known authorized devices and ports.
- Ensure all remote access methods (VPN, SSH, BlackBerry, ActiveSync, Citrix, SMTP, IMAP, WebMail, whatever) have no extra authentication tacked on, and fully vet them for unauthorized access methods.
- Ensure remote WAN links trace to fully employed people, and verify it. Especially wireless connections. You don't want them walking off with a company paid cell-modem or smart-phone. Contact all such users to ensure they have the right device.
- Fully audit internal privileged-access arrangements. These are things like SSH/VNC/RDP/DRAC/iLO/IMPI access to servers that general users don't have, or any access to sensitive systems like payroll.
- Work with all external vendors and service providers to ensure contacts are correct.
- Ensure they are eliminated from all contact and service lists. This should be done anyway after any departure, but is extra-important now.
- Validate all contacts are legitimate and have correct contact information, this is to find ghosts that can be impersonated.
- Start hunting for logic bombs.
- Check all automation (task schedulers, cron jobs, UPS call-out lists, or anything that runs on a schedule or is event-triggered) for signs of evil. By "All" I mean all. Check every single crontab. Check every single automated action in your monitoring system, including the probes themselves. Check every single Windows Task Scheduler; even workstations. Unless you work for the government in a highly sensitive area you won't be able to afford "all", do as much as you can.
- Validate key system binaries on every server to ensure they are what they should be. This is tricky, especially on Windows, and nearly impossible to do retroactively on one-off systems.
- Start hunting for rootkits. By definition they're hard to find, but there are scanners for this.
The decision to kick off an audit of this incredible scope needs to be made at a very high level. The decision to treat this as a potential criminal case will be made by your Legal team. If they elect to do some preliminary investigation first, go for it. Start looking.
If you find any evidence, stop immediately.
- Notify your legal team as soon as you find something likely.
- The decision to treat it as a criminal case will be made at that time.
- Further action by untrained hands (you) can spoil evidence and you don't want that, not unless you want the perp to walk free.
- If outside security experts are retained, you are their local expert. Work with them, to their direction. They understand the legal requirements for evidence, you do not.
- There will be a lot of negotiation between the security experts, your management, and legal counsel. That's expected, work with them.
But, really, how far do you have to go? This is where risk management comes into play. Simplistically, this is the method of balancing expected risk against loss. Sysadmins do this when we decide which off-site location we want to put backups; bank safety deposit box vs an out-of-region datacenter. Figuring out how much of this list needs following is an exercise in risk-management.
In this case the assessment will start with a few things:
- The expected skill level of the departed
- The access of the departed
- The expectation that evil was done
- The potential damage of any evil
- Regulatory requirements for reporting perpetrated evil vs preemptively found evil. Generally you have to report the former, but not the later.
The decision of how far down the above rabbit-hole to dive will depend on the answers to these questions. For routine admin departures where expectation of evil is very slight, the full circus is not required; changing admin-level passwords and re-keying any external-facing SSH hosts is probably sufficient. Again, corporate risk-management security posture determines this.
For admins who were terminated for cause, or evil cropped up after their otherwise normal departure, the circus becomes more needed. The worst-case scenario is a paranoid BOFH-type who has been notified that their position will be made redundant in 2 weeks, as that gives them plenty of time to get ready; in circumstances like these Kyle's idea of a generous severance package can mitigate all kind of problems. Even paranoids can forgive a lot of sins after a check containing 4 months pay arrives. That check will probably cost less than the cost of the security consultants needed to ferret out their evil.
But ultimately, it comes down to the cost of determining if evil was done versus the potential cost of any evil actually being done.
Best Answer
The CMU-Intel paper you cited shows (on page 5) that the error rate depends heavily on the part number / manufacturing date of the DRAM module and varies by a factor of 10-1000. There are also some indications that the problem is much less pronounced in recently (2014) manufactured chips.
The number '9.4x10^-14' that you cited was used in the context of a proposed theoretical mitigation mechanism called "PARA" (that might be similar to an existing mitigation mechanism pTRR (pseudo Target Row Refresh)) and is irrelevant to your question, because PARA has nothing to do with ECC.
A second CMU-Intel paper (page 10) mentions the effects of different ECC algorithms on error reduction (factor 10^2 to 10^5, possibly much more with sophisticated memory tests and "guardbanding").
ECC effectively turns the Row Hammer exploit into a DOS attack. 1bit errors will be corrected by ECC, and as soon as a non-correctable 2bit error is detected the system will halt (assuming SECDED ECC).
A solution is to buy hardware that supports pTRR or TRR. See current blog post from Cisco about Row Hammer. At least some manufacturers seem to have one of these mitigation mechanisms built into their DRAM modules, but keep it deeply hidden in their specs. To answer your question: ask the vendor.
Faster refresh rates (32ms instead of 64ms) and aggressive Patrol Scrub intervals help, too, but would have a performance impact. But I don't know any server hardware that actually allows finetuning these parameters.
I guess there's not much you can do on the operating system side except terminating suspicous processes with constant high cpu usage and high cache misses.