Yesterday one of Smile’s web servers saw a considerable slow down, at times, the server was unable to answer occasionally. We started troubleshooting and found it was a problem with a WordPress blog site, that appeared to be getting attacked from a site in China. We disabled the site, and brought the server back to 100%, eventually finding that there was some comment spam that was causing the blog to try and contact itself every second or two. That comment spam has been deleted and everything appears to be back to normal.
Not a Great Week
Somehow the “bad week” of last week has extended into a 10 day marathon of problems. Hopefully the DoS attack of this morning will put a close to an extraordinary number of unrelated and unforeseen problems that we’ve gone through this week. In a nutshell, we’ve seen equipment failures at our data center, fiber outages, routing problems, service attacks, and all the normal day to day problems thrown in for good measure.
While it may appear that we “point fingers” at our data center partners, they have been outstanding in managing issues that at first glance appear to be completely unmanageable. See their letter to customers for more on what went on last week, and just a bit of what they did to resolve it. We have seen shipments of additional equipment, revised emergency plans, additional routing, and plans for a better information and alert system for everyone using the center.
We realize that all the explanations in the world don’t keep your business running. We’re sorry to see a couple of our customers go, and we understand their decisions. We honestly feel that we are seeing real results, and will continue to see improvements to the systems we rely on to bring you the service we provide.
I’ve had more than a couple of people ask why we don’t just move. It took 2 months of planning and a month and a half of execution to get the servers into the facility without noticeable downtime, and to be honest, by the time we had planned the move, we’d be leaving one of the most prepared centers around, because once they’ve fixed all the problems, and put solutions, upgrades, backups to the backups, etc. in place, things tend to be smooth for a long time.
The Smile team hasn’t been as available for support as we normally are, and we apologize. The friendly emails, and encouragement are particularly helpful when the chips are down, thank you to those who sent them along. We take your service very seriously, and many of us take it very personally. We all have been working with our providers to make sure problems are resolved as quickly as possible.
If you have any questions, please send them to us at our support or administration email addresses.
— Scott
Denial of Service Attack
Early this morning, we were alerted to some slowing on our network. We were able to access all of the servers, but access from some of our testing sites was slow. We contacted our data center and began working with them to troubleshoot the issue. We found that a link in Seattle appeared to be where the bottleneck was and contacted that provider. They looked into it and told us that they would have it fixed shortly. As time wore on, we found it was a Denial of Service attack on some of the routers that our traffic travels through in Seattle.
The data center routed us around the issue, later than we would have like, because of the assurances that it was going to be resolved, and Smile service was returned to normal.
Later in the day, we received this notice from the upstream provider:
At 05:00PDT a dDos with a high volume of small udp packets targeted at one
customer’s host began ramping up. By 05:30 the attack was in full force
but impact to our network lagged behind as our normal daily traffic cycle
began it’s increase.
The attack caused packet buffer overflows on our interfaces (router
interfaces were thowing away good and bad packets). At 08:30 we applied
filters on our border which helped stabilized our core and decreased the
impact of the attack but the interfaces to our transit providers and
peerings were still discarding packets. Customers would have seen latency
and loss on many of our connections to transit providers and peers.
At 10:15PDT we contacted all 5 of our providers and had them null route
the target network thus keeping traffic from reaching our border routers.
Traffic has returned to normal levels and balanced as we’d expect.
This was not the result of a failure within our network but rather a
resource starvation issue on our interfaces due to the overwhelming
number of small packets in this distributed denial of service attack.
Thank you for your patience as we worked to isolate and neutralize the
impact to your service.
Tuesday Evening
This evening we brought up a new DNS server outside our Bellingham network, which will allow Smile to continue serving DNS if there is a problem at the data center. We’re working on additional redundancies.
We had a few minutes of downtime while we attempted to move to a faster circuit. There was a problem with the cut over, and we moved back. Our engineers are working with the data center to try and figure out exactly what is failing on the connection.
We are also doing some fiber work this evening, but it should have no impact on Smile Service.
We’re happy to answer questions at www.smileglobal.com/support
Thanks