Data Center Outage on 3/14/2014
On Friday morning, March 14, IT Services experienced an interruption in the services provided by the Campus Data Center.
Published on: March 20, 2014
On Friday morning, March 14, IT Services experienced an interruption in the services provided by the Campus Data Center. This outage affected a number of key campus applications, such as UCLA Logon (Shibboleth), URSA, and portions of our student and financial applications. Also impacted were data center customer servers hosted in our virtual environment.
UCLA Logon was shifted to our back-up site at UC Berkeley, and access to related applications was generally restored by 11:00 a.m. We understand that some users continued to experience connectivity issues with their UCLA Logon-enabled applications. Clearing the cache may have solved this issue, but if you feel that this did notwork for your application, please contact us at the number below.
Most applications operating in our virtual environment were fully operational by 6:30 p.m. We believe that all applications were functioning normally by 9:25 p.m.
We sincerely apologize for the difficulties this outage may have caused students, faculty and staff. IT Services is working hard with our engineers and vendors to prevent similar incidents from happening in the future.
Should you have questions or need additional information, please contact the IT Support Center at 310-825-8000.
For those interested in more technical details, the incident was triggered by the unexpected reboot of a data center network switch during routine maintenance. As designed, a standby switch took over and established connectivity within 15-20 seconds. All systems recovered from the network interruption with the exception of our production virtual server environment.
Several attempts to synchronize our distributed virtual switch with the network did not fully restore operations and, being a lengthy process, greatly extended the recovery time. Once the virtual system environment was stabilized, restart of our more complex multi-tier applications also took several hours.
Problems related to the network switch appear to be have been caused by a CISCO software bug. Difficulties with UCLA Logon after failover are likely related to stale DNS information cached by the application. The IAMUCLA team will work with the appropriate teams to improve the fail-over procedures for those applications. We are working with our hardware and software vendors to fully understand and improve the resiliency of our virtual environment and will share additional information as it is known.
Associate Vice Chancellor
Information Technology Services
University of California, Los Angeles
March 20, 2014