Difference between revisions of "Infrastructure:Incident 2015-04-23"
imported>Mutantmonkey (Created page with "In the early morning of April 23, 2015, Whittemore lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by: * Maintenance deferr...") |
imported>Mutantmonkey |
||
Line 1: | Line 1: | ||
− | In the early morning of April 23, 2015, Whittemore lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by: | + | In the early morning of April 23, 2015, [[gp:Whittemore Hall|Whittemore Hall]] lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by: |
* Maintenance deferred for far too long, which prevented some machines from booting on their own. | * Maintenance deferred for far too long, which prevented some machines from booting on their own. | ||
* Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly. | * Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly. | ||
Line 6: | Line 6: | ||
Some steps that should be taken to reduce the length and impact of future outages include: | Some steps that should be taken to reduce the length and impact of future outages include: | ||
− | * Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia. | + | * Rebuild VMs like [[Infrastructure:milton|milton]], which have had too much maintenance deferred, and move them to cyberdelia. |
* Install NTP on all servers. | * Install NTP on all servers. | ||
* Find a dedicated sysadmin. | * Find a dedicated sysadmin. | ||
* Create a disaster recovery plan. | * Create a disaster recovery plan. |
Revision as of 03:04, 24 April 2015
In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:
- Maintenance deferred for far too long, which prevented some machines from booting on their own.
- Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
- Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
- Lack of a disaster recovery plan.
Some steps that should be taken to reduce the length and impact of future outages include:
- Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
- Install NTP on all servers.
- Find a dedicated sysadmin.
- Create a disaster recovery plan.