Difference between revisions of "Infrastructure:Incident 2015-04-23"
imported>Telnoratti m (Telnoratti moved page Incident:2015-04-23 to Infrastructure:2015-04-23) |
|||
(9 intermediate revisions by 4 users not shown) | |||
Line 5: | Line 5: | ||
** This is a club of Linux users, someone should step up! | ** This is a club of Linux users, someone should step up! | ||
* Lack of a disaster recovery plan. | * Lack of a disaster recovery plan. | ||
+ | * Undocumented systems with several different iterations of init scripts none of which were removed and many broken packages that were unused | ||
Some steps that should be taken to reduce the length and impact of future outages include: | Some steps that should be taken to reduce the length and impact of future outages include: | ||
− | * Rebuild VMs like [[Infrastructure: | + | * Rebuild VMs like [[Infrastructure:Milton|milton]], which have had too much maintenance deferred, and move them to [[Infrastructure:Cyberdelia|cyberdelia]]. |
* Install NTP on all servers. | * Install NTP on all servers. | ||
* Find a dedicated sysadmin. | * Find a dedicated sysadmin. | ||
* Create a disaster recovery plan. | * Create a disaster recovery plan. | ||
+ | * Reduce services to match maintenance capabilities | ||
+ | |||
+ | [[Category:Incidents]] |
Latest revision as of 20:57, 3 January 2019
In the early morning of April 23, 2015, Whittemore Hall lost power and brought down VTLUUG infrastructure. Issues bringing hardware back up were compounded by:
- Maintenance deferred for far too long, which prevented some machines from booting on their own.
- Failure to notice the clocks on VMs were wrong due to a dead CMOS battery on wood, which prevented Kerberos from working properly.
- Lack of a dedicated sysadmin with the time and experience necessary to correct these problems.
- This is a club of Linux users, someone should step up!
- Lack of a disaster recovery plan.
- Undocumented systems with several different iterations of init scripts none of which were removed and many broken packages that were unused
Some steps that should be taken to reduce the length and impact of future outages include:
- Rebuild VMs like milton, which have had too much maintenance deferred, and move them to cyberdelia.
- Install NTP on all servers.
- Find a dedicated sysadmin.
- Create a disaster recovery plan.
- Reduce services to match maintenance capabilities