Infrastructure:Incident 2015-04-23
In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:
- Maintenance deferred for far too long, which prevented some machines from booting on their own.
- Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
- Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
- This is a club of Linux users, someone should step up!
- Lack of a disaster recovery plan.
- Undocumented systems with several different iterations of init scripts none of which were removed and many broken packages that were unused
Some steps that should be taken to reduce the length and impact of future outages include:
- Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
- Install NTP on all servers.
- Find a dedicated sysadmin.
- Create a disaster recovery plan.
- Reduce services to match maintenance capabilities