Difference between revisions of "Infrastructure:Incident 2015-04-23"

Revision as of 03:07, 24 April 2015

In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:

Maintenance deferred for far too long, which prevented some machines from booting on their own.
Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
- This is a club of Linux users, someone should step up!
Lack of a disaster recovery plan.

Some steps that should be taken to reduce the length and impact of future outages include:

Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
Install NTP on all servers.
Find a dedicated sysadmin.
Create a disaster recovery plan.

@@ Line 2: / Line 2: @@
 * Maintenance deferred for far too long, which prevented some machines from booting on their own.
 * Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
-* Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
+* '''Lack of a dedicated sysadmin''' with the time and experience necessary to correct these problems.
+** This is a club of Linux users, someone should step up!
 * Lack of a disaster recovery plan.

Difference between revisions of "Infrastructure:Incident 2015-04-23"

Revision as of 03:07, 24 April 2015

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools