Memorial Day Weekend Issues

June 1, 2007 in Smile News by admin  |  Comments Off on Memorial Day Weekend Issues

Smile Internet has prided itself on our high level of uptime over the years. We have worked with our data center partners to provide you high quality Internet services at affordable prices.

Our data center is a telco grade facility, with full battery backup and a natural gas generator, redundant Internet connections, FM200 fire suppression, and a wealth of other technologies that keep things humming along 24 hours a day. Over the past week, our data center has experienced a series of problems. All of which were unrelated, and unforeseen. While they take precautions to plan for any type of outages, the series of problems that occurred this week were significant enough to take Smile and a number of other customers offline for significant periods of time. Below is a letter from the data center:


Over the past week our Bellingham customers have seen many issues with our stability and performance. Our redundancies and staff were tested to the limits. I hope to explain this week so many of you will have some knowledge that these were not normal conditions.

CSS’s Bellingham data center is connected via a redundant loop of fiber to Seattle, and redundant upstream connectivity through both Bellingham and Seattle.

On Friday of Memorial Day weekend one of our upstream providers had a hardware failure causing routing loops and slow downs in our Seattle upstream connection, we were able to divert all traffic through our Bellingham data center and correct this performance issue with minimal customer impact.

On Memorial Day, our Gigabit redundant link failed with no customer impact. Equipment was replaced and redundancy was returned.

On Wednesday at approximately 4:15 pm a high density distribution switch had a processor failure (Supervisor III module) and became completely unresponsive, disabling a large majority of our fiber connectivity. Ironically an upgrade processor, and spare had been ordered the day before and was on its way to our facilities with arrival on Thursday. Using available resources we were able to return most customers to operations by 7:00pm, and picked up additional equipment in Seattle to repair the final few customers. All customers returned to operation at aprox 1:00am. Diverting around this failure required multiple standby devices be placed in operation, flattening the VLAN and dropping our redundant link to prevent routing loops.

On Thursday morning at 9:30 the additional switching devices overloaded the circuit breaker and power failed to these devices, power was split across multiple sources and power was returned. As the switches returned to operation the combination of multiple vendor products, and customer devices started a packet storm with random outages and stability issues. At 11:30 am our replacement processor arrived and customers were returned to the primary equipment. At that time we opted to not reconnect our backup link and scheduled this for a maintenance window to perform additional changes and prevent further circuit drops or routing loops. This was scheduled for Sunday at 11:00 pm.

On Sunday at 2:30pm our primary link carrier had a failure between Bellingham and Seattle. The carrier had multiple circuits go down with no alarms or port notifications which prevented our backup Internet link from taking over full connectivity, and with our redundant link down, traffic flow was encumbered again. Technicians from CSS and the carrier were on-site in 20 minutes and diagnostics began, once the issue was identified we rebuilt the redundant link, and re-established routing. We remained on-site and worked with our carrier to reestablish our primary link. Staff remained on-site testing and verifying connectivity, and re-organizing to increase stability. Performance was unable to be tested under load but non-loaded performance was at expected speeds.

During Monday AM CSS monitored performance and noticed issues with our primary link performance, working with our carriers and transitioning between our backup and primary links we continued to tune our connections to return to optimum performance. As of 10:00 am Monday network performance, redundancy and stability has returned to 100%.

Thank you for your patience, and support through this process. We hope your confidence in the capabilities and reliability of CSS remain intact as we reestablish our level of service so many of our customers have become accustomed to.

Regards,
Ray Poorman
President & CTO – CSS Integration & Communications


Posted in Smile News.

Comments are closed.

System Status

Web

Online

Email

Online

SmileMail

Online

DNS

Online

Customer Portal

Online
  • Recent News

    PrevNext
    Smile Customer Portal Maintenance

    We will be performing maintenance on the Customer Portal/Billing Site today (12/19/2015), between the hours of 4PM and 6PM. During this time our billing system will be unavailable – please check back for status updates. Thank you, Smile Support Team     UPDATE: 12/19/2015 @6PM — We’ve completed system maintenance and migrated our billing system/Customer …

    Billing System Upgraded

    Good news, folks! Smile has upgraded our billing system to the latest version, and rolled out a new look to our client area. This is more of a temporary face-lift while we continue to update things on the backend and prepare for the launch of our new unified website. The next step will be moving …

    Infrastructure Improvements

    We are excited to announce that Smile Global is hard at work phasing out our legacy servers, and moving our core services to a brand new infrastructure. This has started with email migration to our new and improved mail system — which has some pretty neat features we know you will enjoy. Namely, all management …

    Status site being rebuilt

    We’ve been recovering from a hardware failure on a leased server, which hosted our status site and several other Uncomplicated.net customer websites. Our development team is in the process of rebuilding our status site and integrating it with our primary website. Please check back for more updates at a later date. Thanks, Smile Support Team