Open main menu

Linux and Unix Users Group at Virginia Teck Wiki β

Infrastructure:Incident 2015-04-23

Revision as of 20:55, 3 January 2019 by Pew (talk | contribs)

In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:

  • Maintenance deferred for far too long, which prevented some machines from booting on their own.
  • Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
  • Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
    • This is a club of Linux users, someone should step up!
  • Lack of a disaster recovery plan.
  • Undocumented systems with several different iterations of init scripts none of which were removed and many broken packages that were unused

Some steps that should be taken to reduce the length and impact of future outages include:

  • Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
  • Install NTP on all servers.
  • Find a dedicated sysadmin.
  • Create a disaster recovery plan.
  • Reduce services to match maintenance capabilities