Smile Internet has prided itself on our high level of uptime over the years. We have worked with our data center partners to provide you high quality Internet services at affordable prices.
Our data center is a telco grade facility, with full battery backup and a natural gas generator, redundant Internet connections, FM200 fire suppression, and a wealth of other technologies that keep things humming along 24 hours a day. Over the past week, our data center has experienced a series of problems. All of which were unrelated, and unforeseen. While they take precautions to plan for any type of outages, the series of problems that occurred this week were significant enough to take Smile and a number of other customers offline for significant periods of time. Below is a letter from the data center:
Over the past week our Bellingham customers have seen many issues with our stability and performance. Our redundancies and staff were tested to the limits. I hope to explain this week so many of you will have some knowledge that these were not normal conditions.
CSS’s Bellingham data center is connected via a redundant loop of fiber to Seattle, and redundant upstream connectivity through both Bellingham and Seattle.
On Friday of Memorial Day weekend one of our upstream providers had a hardware failure causing routing loops and slow downs in our Seattle upstream connection, we were able to divert all traffic through our Bellingham data center and correct this performance issue with minimal customer impact.
On Memorial Day, our Gigabit redundant link failed with no customer impact. Equipment was replaced and redundancy was returned.
On Wednesday at approximately 4:15 pm a high density distribution switch had a processor failure (Supervisor III module) and became completely unresponsive, disabling a large majority of our fiber connectivity. Ironically an upgrade processor, and spare had been ordered the day before and was on its way to our facilities with arrival on Thursday. Using available resources we were able to return most customers to operations by 7:00pm, and picked up additional equipment in Seattle to repair the final few customers. All customers returned to operation at aprox 1:00am. Diverting around this failure required multiple standby devices be placed in operation, flattening the VLAN and dropping our redundant link to prevent routing loops.
On Thursday morning at 9:30 the additional switching devices overloaded the circuit breaker and power failed to these devices, power was split across multiple sources and power was returned. As the switches returned to operation the combination of multiple vendor products, and customer devices started a packet storm with random outages and stability issues. At 11:30 am our replacement processor arrived and customers were returned to the primary equipment. At that time we opted to not reconnect our backup link and scheduled this for a maintenance window to perform additional changes and prevent further circuit drops or routing loops. This was scheduled for Sunday at 11:00 pm.
On Sunday at 2:30pm our primary link carrier had a failure between Bellingham and Seattle. The carrier had multiple circuits go down with no alarms or port notifications which prevented our backup Internet link from taking over full connectivity, and with our redundant link down, traffic flow was encumbered again. Technicians from CSS and the carrier were on-site in 20 minutes and diagnostics began, once the issue was identified we rebuilt the redundant link, and re-established routing. We remained on-site and worked with our carrier to reestablish our primary link. Staff remained on-site testing and verifying connectivity, and re-organizing to increase stability. Performance was unable to be tested under load but non-loaded performance was at expected speeds.
During Monday AM CSS monitored performance and noticed issues with our primary link performance, working with our carriers and transitioning between our backup and primary links we continued to tune our connections to return to optimum performance. As of 10:00 am Monday network performance, redundancy and stability has returned to 100%.
Thank you for your patience, and support through this process. We hope your confidence in the capabilities and reliability of CSS remain intact as we reestablish our level of service so many of our customers have become accustomed to.
Regards,
Ray Poorman
President & CTO – CSS Integration & Communications