Difference between revisions of "Infrastructure:Incident 2015-04-23"

From the Linux and Unix Users Group at Virginia Teck Wiki
Jump to: navigation, search
imported>Mutantmonkey
imported>Telnoratti
m (Telnoratti moved page Incident:2015-04-23 to Infrastructure:2015-04-23)
(No difference)

Revision as of 03:13, 24 April 2015

In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:

  • Maintenance deferred for far too long, which prevented some machines from booting on their own.
  • Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
  • Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
    • This is a club of Linux users, someone should step up!
  • Lack of a disaster recovery plan.

Some steps that should be taken to reduce the length and impact of future outages include:

  • Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
  • Install NTP on all servers.
  • Find a dedicated sysadmin.
  • Create a disaster recovery plan.