Difference between revisions of "Infrastructure:Incident 2015-04-23"

From the Linux and Unix Users Group at Virginia Teck Wiki
Jump to: navigation, search
imported>Mutantmonkey
imported>Mutantmonkey
Line 2: Line 2:
 
* Maintenance deferred for far too long, which prevented some machines from booting on their own.
 
* Maintenance deferred for far too long, which prevented some machines from booting on their own.
 
* Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
 
* Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
* Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
+
* '''Lack of a dedicated sysadmin''' with the time and experience necessary to correct these problems.
 +
** This is a club of Linux users, someone should step up!
 
* Lack of a disaster recovery plan.
 
* Lack of a disaster recovery plan.
  

Revision as of 03:07, 24 April 2015

In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:

  • Maintenance deferred for far too long, which prevented some machines from booting on their own.
  • Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
  • Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
    • This is a club of Linux users, someone should step up!
  • Lack of a disaster recovery plan.

Some steps that should be taken to reduce the length and impact of future outages include:

  • Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
  • Install NTP on all servers.
  • Find a dedicated sysadmin.
  • Create a disaster recovery plan.