Difference between revisions of "Infrastructure:Incident 2015-04-23"

From the Linux and Unix Users Group at Virginia Teck Wiki
Jump to: navigation, search
imported>Mutantmonkey
(Created page with "In the early morning of April 23, 2015, Whittemore lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by: * Maintenance deferr...")
 
imported>Mutantmonkey
Line 1: Line 1:
In the early morning of April 23, 2015, Whittemore lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:
+
In the early morning of April 23, 2015, [[gp:Whittemore Hall|Whittemore Hall]] lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:
 
* Maintenance deferred for far too long, which prevented some machines from booting on their own.
 
* Maintenance deferred for far too long, which prevented some machines from booting on their own.
 
* Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
 
* Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
Line 6: Line 6:
  
 
Some steps that should be taken to reduce the length and impact of future outages include:
 
Some steps that should be taken to reduce the length and impact of future outages include:
* Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
+
* Rebuild VMs like [[Infrastructure:milton|milton]], which have had too much maintenance deferred, and move them to cyberdelia.
 
* Install NTP on all servers.
 
* Install NTP on all servers.
 
* Find a dedicated sysadmin.
 
* Find a dedicated sysadmin.
 
* Create a disaster recovery plan.
 
* Create a disaster recovery plan.

Revision as of 03:04, 24 April 2015

In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:

  • Maintenance deferred for far too long, which prevented some machines from booting on their own.
  • Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
  • Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
  • Lack of a disaster recovery plan.

Some steps that should be taken to reduce the length and impact of future outages include:

  • Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
  • Install NTP on all servers.
  • Find a dedicated sysadmin.
  • Create a disaster recovery plan.