MediaTemple meltdown

October 2, 2007 § 15 Comments

We’ve been using MediaTemple‘s grid server for several months and until recently had been generally pleased with them. Our pages took ~.5 – 1 second to load, which is not great, but not awful. Additionally, we hoped that by being on the grid server, we would be protected from spikes in traffic that could bring a single server down. Last Friday load times started getting slower (~10 seconds) then over the weekend they got up to 45 seconds to over a minute. I called MediaTemple and asked them what the issue was. The rep stated that we were probably just making too many database calls. I then directed him to a page we set up that had zero database calls. The rep responded with a panicked “OK we’ll call you back!” and hung up. This was Monday morning. It’s now Tuesday night and they are still a mess. Our pages were alternating between taking over a minute to load and serving up errors. That’s over 36 hours of a professional hosting service being totally jacked. Unbelievable. We’re now on M5 Hosting and won’t ever be on MediaTemple again.

If they truly have had a massive “spike in demand” recently, I wonder if some Facebook apps that were hosted on their grid took off over the weekend. These are the updates that they have been posting on their site:

Web and email latency on Grid-Service Cluster.2
Incident Tracker status: HIGH

Monitoring system updated, AccountCenter maintenance
Wednesday, October 3rd, 2007 at 1:26 pm
By internal metrics Grid Cluster.2 is performing much better than before this incident began. No latency issues were detected today, given that we doubled the amount of RAM and increased the number of servers used as cluster nodes by 25 percent this was not unexpected. Our teams will continue to re-distribute storage load across the new resources to further reduce I/O related latency. (mt) Engineers have come up with new ways to measure Grid performance and will be adding them to our monitoring systems over the next few days, increasing the likelihood that we will detect symptoms before they affect customers. We have also changed our growth projection formulas so they will better predict when we need to add hardware to the clusters, avoiding issues like this in the future. In the next few days we will be scheduling a maintenance window for the AccountCenter so we can eliminate the main cause of slow page loads in the customer interface. We are leaving this incident open for the next 24 hours while we continue to work on improving performance and monitoring for Cluster.2

 

Additional tuning
Tuesday, October 2nd, 2007 at 4:40 pm
After mitigating most of today’s latency issues our engineering teams are continuing to work on tuning Cluster.2 Areas where we’ve made changes include major hardware additions, firewall rules, load balancers, networking, service configuration, storage tweaks and AccountCenter speed enhancements. The symptoms primarily manifested as latency issues in web page load times, SMTP and FTP. Unsatisfactory performance was also reported in MySQL enabled applications and AccountCenter management features. We consider the service level of Cluster.2 over the last two days to be unacceptable and are doing our utmost to correct the situation.

Performance Improvements
Tuesday, October 2nd, 2007 at 1:46 pm
After making several more tweaks to Grid Cluster.2 including firewall configuration changes and filesystem tuning performance has improved dramatically. (mt) Engineers have seen vast improvement in basic PHP page load times compared this morning. All other services including MySQL, SMTP and FTP should see corresponding latency decreases. All nodes have had RAM upgrades and are performing well, even so the load across Cluster.2 is still higher than we’d like. Our teams are still hard at work on this issue, we’ll keep updating our customers with our progress. Thank you for you patience.

More new hardware
Tuesday, October 2nd, 2007 at 11:55 am
In order to combat this lingering issue (mt) system engineers have doubled the amount of available RAM in every Grid Cluster.2 node. Combined with the 25% increase in total nodes we are seeing major performance gains for the cluster and latency times are plunging. We are still working furiously on this issue and will update this thread as soon as we have news.

Progress made, some latency returns
Tuesday, October 2nd, 2007 at 7:50 am
As of 6:30AM PDT we detected latency increasing across some nodes of Grid Cluster.2, services like FTP, SMTP and web pages (HTTP) are affected. We have engaged several teams of engineers, data center personnel and third party vendors to bring a resolution to these issues as soon as possible. Again, we thank you for your patience in this matter.

Latency times back to normal
Monday, October 1st, 2007 at 5:47 pm
(mt) Engineers made several changes throughout the day to improve performance of Grid Cluster.2 These changes include the addition of more available nodes, reconfiguring services and various networking tweaks. Our team is closely monitoring Cluster.2 and AccountCenter performance to ensure that the latency issues do not recur.

Continued work
Monday, October 1st, 2007 at 2:27 pm
We are still receiving reports of sporadic latency across Grid Cluster.2 Our engineers are currently working to eliminate any remaining issues that may be causing slow response times. We have also implemented several fixes to the AccountCenter that will help eliminate slow page loads. We will update this thread as soon as we have more information. Thank you for your patience in this matter.

Nodes added, services coming back online
Monday, October 1st, 2007 at 12:32 pm
(mt) Engineers have determined that the latency issue affecting Cluster.2 was due to unexpected growth causing a general lack of computational resources. To resolve the issue (mt) data center personnel have added more machines to Cluster.2 increasing the number of available nodes by 25 percent. All services including FTP, Email and HTTP are coming back online. Recent demands for computational resources have jumped unexpectedly which caused degraded performance of Cluster.2 Our engineers are re-evaluating the projected growth formula used to determine when Grid resources need to be added. We apologize for any inconvenience this may have caused.

Web and email latency on Grid-Service Cluster 2
Monday, October 1st, 2007 at 9:48 am
Some customers on Grid-Service Cluster.2 may be experiencing latency to web and email. There may also be some latency for all customers accessing the Account Center. (mt) Media Temple’s Systems Engineers are currently investigating and working to resolve the issue as quickly as possible. Thank you for your patience and understanding.

Tagged: ,

§ 15 Responses to MediaTemple meltdown

Leave a reply to Jim Goings Cancel reply

What’s this?

You are currently reading MediaTemple meltdown at robwebb2k.

meta