Not a Great Week

Staff Blogs Comments Off

Somehow the “bad week” of last week has extended into a 10 day marathon of problems. Hopefully the DoS attack of this morning will put a close to an extraordinary number of unrelated and unforeseen problems that we’ve gone through this week. In a nutshell, we’ve seen equipment failures at our data center, fiber outages, routing problems, service attacks, and all the normal day to day problems thrown in for good measure.

While it may appear that we “point fingers” at our data center partners, they have been outstanding in managing issues that at first glance appear to be completely unmanageable. See their letter to customers for more on what went on last week, and just a bit of what they did to resolve it. We have seen shipments of additional equipment, revised emergency plans, additional routing, and plans for a better information and alert system for everyone using the center.

We realize that all the explanations in the world don’t keep your business running. We’re sorry to see a couple of our customers go, and we understand their decisions. We honestly feel that we are seeing real results, and will continue to see improvements to the systems we rely on to bring you the service we provide.

I’ve had more than a couple of people ask why we don’t just move. It took 2 months of planning and a month and a half of execution to get the servers into the facility without noticeable downtime, and to be honest, by the time we had planned the move, we’d be leaving one of the most prepared centers around, because once they’ve fixed all the problems, and put solutions, upgrades, backups to the backups, etc. in place, things tend to be smooth for a long time.

The Smile team hasn’t been as available for support as we normally are, and we apologize. The friendly emails, and encouragement are particularly helpful when the chips are down, thank you to those who sent them along. We take your service very seriously, and many of us take it very personally. We all have been working with our providers to make sure problems are resolved as quickly as possible.

If you have any questions, please send them to us at our support or administration email addresses.

– Scott

Denial of Service Attack

Smile News Comments Off

Early this morning, we were alerted to some slowing on our network. We were able to access all of the servers, but access from some of our testing sites was slow. We contacted our data center and began working with them to troubleshoot the issue. We found that a link in Seattle appeared to be where the bottleneck was and contacted that provider. They looked into it and told us that they would have it fixed shortly. As time wore on, we found it was a Denial of Service attack on some of the routers that our traffic travels through in Seattle.

The data center routed us around the issue, later than we would have like, because of the assurances that it was going to be resolved, and Smile service was returned to normal.

Later in the day, we received this notice from the upstream provider:



At 05:00PDT a dDos with a high volume of small udp packets targeted at one
customer’s host began ramping up. By 05:30 the attack was in full force
but impact to our network lagged behind as our normal daily traffic cycle
began it’s increase.

The attack caused packet buffer overflows on our interfaces (router
interfaces were thowing away good and bad packets). At 08:30 we applied
filters on our border which helped stabilized our core and decreased the
impact of the attack but the interfaces to our transit providers and
peerings were still discarding packets. Customers would have seen latency
and loss on many of our connections to transit providers and peers.

At 10:15PDT we contacted all 5 of our providers and had them null route
the target network thus keeping traffic from reaching our border routers.

Traffic has returned to normal levels and balanced as we’d expect.

This was not the result of a failure within our network but rather a
resource starvation issue on our interfaces due to the overwhelming
number of small packets in this distributed denial of service attack.

Thank you for your patience as we worked to isolate and neutralize the
impact to your service.

Tuesday Evening

Smile News Comments Off

This evening we brought up a new DNS server outside our Bellingham network, which will allow Smile to continue serving DNS if there is a problem at the data center.  We’re working on additional redundancies.

We had a few minutes of downtime while we attempted to move to a faster circuit.  There was a problem with the cut over, and we moved back.  Our engineers are working with the data center to try and figure out exactly what is failing on the connection.

We are also doing some fiber work this evening, but it should have no impact on Smile Service.

We’re happy to answer questions at www.smileglobal.com/support

Thanks

Scheduled Maintenance

Server Status Comments Off
Smile has scheduled a maintenance period for Tuesday June 5, from 11:00pm – 1:00 am Pacific Time US (-0700). We will be moving back to our primary internet circuit. We anticipate less than 15 minutes of downtime during this period.

Memorial Day Weekend Issues

Smile News Comments Off

Smile Internet has prided itself on our high level of uptime over the years. We have worked with our data center partners to provide you high quality Internet services at affordable prices.

Our data center is a telco grade facility, with full battery backup and a natural gas generator, redundant Internet connections, FM200 fire suppression, and a wealth of other technologies that keep things humming along 24 hours a day. Over the past week, our data center has experienced a series of problems. All of which were unrelated, and unforeseen. While they take precautions to plan for any type of outages, the series of problems that occurred this week were significant enough to take Smile and a number of other customers offline for significant periods of time. Below is a letter from the data center:


Over the past week our Bellingham customers have seen many issues with our stability and performance. Our redundancies and staff were tested to the limits. I hope to explain this week so many of you will have some knowledge that these were not normal conditions.

CSS’s Bellingham data center is connected via a redundant loop of fiber to Seattle, and redundant upstream connectivity through both Bellingham and Seattle.

On Friday of Memorial Day weekend one of our upstream providers had a hardware failure causing routing loops and slow downs in our Seattle upstream connection, we were able to divert all traffic through our Bellingham data center and correct this performance issue with minimal customer impact.

On Memorial Day, our Gigabit redundant link failed with no customer impact. Equipment was replaced and redundancy was returned.

On Wednesday at approximately 4:15 pm a high density distribution switch had a processor failure (Supervisor III module) and became completely unresponsive, disabling a large majority of our fiber connectivity. Ironically an upgrade processor, and spare had been ordered the day before and was on its way to our facilities with arrival on Thursday. Using available resources we were able to return most customers to operations by 7:00pm, and picked up additional equipment in Seattle to repair the final few customers. All customers returned to operation at aprox 1:00am. Diverting around this failure required multiple standby devices be placed in operation, flattening the VLAN and dropping our redundant link to prevent routing loops.

On Thursday morning at 9:30 the additional switching devices overloaded the circuit breaker and power failed to these devices, power was split across multiple sources and power was returned. As the switches returned to operation the combination of multiple vendor products, and customer devices started a packet storm with random outages and stability issues. At 11:30 am our replacement processor arrived and customers were returned to the primary equipment. At that time we opted to not reconnect our backup link and scheduled this for a maintenance window to perform additional changes and prevent further circuit drops or routing loops. This was scheduled for Sunday at 11:00 pm.

On Sunday at 2:30pm our primary link carrier had a failure between Bellingham and Seattle. The carrier had multiple circuits go down with no alarms or port notifications which prevented our backup Internet link from taking over full connectivity, and with our redundant link down, traffic flow was encumbered again. Technicians from CSS and the carrier were on-site in 20 minutes and diagnostics began, once the issue was identified we rebuilt the redundant link, and re-established routing. We remained on-site and worked with our carrier to reestablish our primary link. Staff remained on-site testing and verifying connectivity, and re-organizing to increase stability. Performance was unable to be tested under load but non-loaded performance was at expected speeds.

During Monday AM CSS monitored performance and noticed issues with our primary link performance, working with our carriers and transitioning between our backup and primary links we continued to tune our connections to return to optimum performance. As of 10:00 am Monday network performance, redundancy and stability has returned to 100%.

Thank you for your patience, and support through this process. We hope your confidence in the capabilities and reliability of CSS remain intact as we reestablish our level of service so many of our customers have become accustomed to.

Regards,
Ray Poorman
President & CTO – CSS Integration & Communications


Design by j david macor.com.Original WP Theme & Icons by N.Design Studio
Entries RSS Log in

Smile Internet Networks