MediaTemple meltdown
October 2, 2007 § 15 Comments
We’ve been using MediaTemple‘s grid server for several months and until recently had been generally pleased with them. Our pages took ~.5 – 1 second to load, which is not great, but not awful. Additionally, we hoped that by being on the grid server, we would be protected from spikes in traffic that could bring a single server down. Last Friday load times started getting slower (~10 seconds) then over the weekend they got up to 45 seconds to over a minute. I called MediaTemple and asked them what the issue was. The rep stated that we were probably just making too many database calls. I then directed him to a page we set up that had zero database calls. The rep responded with a panicked “OK we’ll call you back!” and hung up. This was Monday morning. It’s now Tuesday night and they are still a mess. Our pages were alternating between taking over a minute to load and serving up errors. That’s over 36 hours of a professional hosting service being totally jacked. Unbelievable. We’re now on M5 Hosting and won’t ever be on MediaTemple again.
If they truly have had a massive “spike in demand” recently, I wonder if some Facebook apps that were hosted on their grid took off over the weekend. These are the updates that they have been posting on their site:
Web and email latency on Grid-Service Cluster.2
Incident Tracker status: HIGH
Monitoring system updated, AccountCenter maintenance
Wednesday, October 3rd, 2007 at 1:26 pm
By internal metrics Grid Cluster.2 is performing much better than before this incident began. No latency issues were detected today, given that we doubled the amount of RAM and increased the number of servers used as cluster nodes by 25 percent this was not unexpected. Our teams will continue to re-distribute storage load across the new resources to further reduce I/O related latency. (mt) Engineers have come up with new ways to measure Grid performance and will be adding them to our monitoring systems over the next few days, increasing the likelihood that we will detect symptoms before they affect customers. We have also changed our growth projection formulas so they will better predict when we need to add hardware to the clusters, avoiding issues like this in the future. In the next few days we will be scheduling a maintenance window for the AccountCenter so we can eliminate the main cause of slow page loads in the customer interface. We are leaving this incident open for the next 24 hours while we continue to work on improving performance and monitoring for Cluster.2
Additional tuning
Tuesday, October 2nd, 2007 at 4:40 pm
After mitigating most of today’s latency issues our engineering teams are continuing to work on tuning Cluster.2 Areas where we’ve made changes include major hardware additions, firewall rules, load balancers, networking, service configuration, storage tweaks and AccountCenter speed enhancements. The symptoms primarily manifested as latency issues in web page load times, SMTP and FTP. Unsatisfactory performance was also reported in MySQL enabled applications and AccountCenter management features. We consider the service level of Cluster.2 over the last two days to be unacceptable and are doing our utmost to correct the situation.
Performance Improvements
Tuesday, October 2nd, 2007 at 1:46 pm
After making several more tweaks to Grid Cluster.2 including firewall configuration changes and filesystem tuning performance has improved dramatically. (mt) Engineers have seen vast improvement in basic PHP page load times compared this morning. All other services including MySQL, SMTP and FTP should see corresponding latency decreases. All nodes have had RAM upgrades and are performing well, even so the load across Cluster.2 is still higher than we’d like. Our teams are still hard at work on this issue, we’ll keep updating our customers with our progress. Thank you for you patience.
More new hardware
Tuesday, October 2nd, 2007 at 11:55 am
In order to combat this lingering issue (mt) system engineers have doubled the amount of available RAM in every Grid Cluster.2 node. Combined with the 25% increase in total nodes we are seeing major performance gains for the cluster and latency times are plunging. We are still working furiously on this issue and will update this thread as soon as we have news.
Progress made, some latency returns
Tuesday, October 2nd, 2007 at 7:50 am
As of 6:30AM PDT we detected latency increasing across some nodes of Grid Cluster.2, services like FTP, SMTP and web pages (HTTP) are affected. We have engaged several teams of engineers, data center personnel and third party vendors to bring a resolution to these issues as soon as possible. Again, we thank you for your patience in this matter.
Latency times back to normal
Monday, October 1st, 2007 at 5:47 pm
(mt) Engineers made several changes throughout the day to improve performance of Grid Cluster.2 These changes include the addition of more available nodes, reconfiguring services and various networking tweaks. Our team is closely monitoring Cluster.2 and AccountCenter performance to ensure that the latency issues do not recur.
Continued work
Monday, October 1st, 2007 at 2:27 pm
We are still receiving reports of sporadic latency across Grid Cluster.2 Our engineers are currently working to eliminate any remaining issues that may be causing slow response times. We have also implemented several fixes to the AccountCenter that will help eliminate slow page loads. We will update this thread as soon as we have more information. Thank you for your patience in this matter.
Nodes added, services coming back online
Monday, October 1st, 2007 at 12:32 pm
(mt) Engineers have determined that the latency issue affecting Cluster.2 was due to unexpected growth causing a general lack of computational resources. To resolve the issue (mt) data center personnel have added more machines to Cluster.2 increasing the number of available nodes by 25 percent. All services including FTP, Email and HTTP are coming back online. Recent demands for computational resources have jumped unexpectedly which caused degraded performance of Cluster.2 Our engineers are re-evaluating the projected growth formula used to determine when Grid resources need to be added. We apologize for any inconvenience this may have caused.
Web and email latency on Grid-Service Cluster 2
Monday, October 1st, 2007 at 9:48 am
Some customers on Grid-Service Cluster.2 may be experiencing latency to web and email. There may also be some latency for all customers accessing the Account Center. (mt) Media Temple’s Systems Engineers are currently investigating and working to resolve the issue as quickly as possible. Thank you for your patience and understanding.
Rob,
Cheers, and many apologies for any inconvenience. I would elaborate on the issues, but we’ve pretty much documented them as openly as possible on our site.
More or less this was a complete anomoly and clearly this level of performance is not up to our standards. Fell free to contact me with your account info and any other questions you may have.
Best,
Jason Mcvearry
jason@mediatemple.net
On another note..several clients on the Grid have frequently been dugg and linked on Reddit at the same time and their sites did not go down (or hiccup). The Grid works, but as with any new technology, upgrades and unusual usage patterns can create issues. Thankfully we’ve got a staff of engineers unafraid to miss a few days of sleep.
Cheers again,
Jason
[…] Looks like I am not the only one pondering a switch away from MediaTemple. Technorati tags: mediatemple — Related Posts […]
You should use a VPS, just like dedicated but cheaper and more flexible. Try Slicehost.
hmm, yes, I called them and they told me they will solve it in one hours. then one day passed..
The “maintenance” was supposed to be 1.5 hours, then they changed it to 6 and now “they don’t know” I am seriously pissed at these guys. I wasted over a hundred dollars in PPC clicks going to a site that isn’t there.
This massive outage is unacceptable I am definetly moving my servers elsewhere!
This was really a deplorable performance by mediatemple – they already have a really bad customer service and this brings into question their technical competence too:
http://pakistaniat.com/2007/12/01/atps-disappearance-no-we-were-not-blocked-or-hacked-not-yet/
[…] with that innovation came some hiccups. Intermittently, over the past year, the grid simply hasn’t been able to handle the loads placed upon it by all the new users. The major problems have been latency with database calls and page load times […]
I suppose this is a no-brainer, obvious point, but, if you are running a web site you care about, you should have a way to monitor it remotely and have that remote site send you email when there is a problem.
E.g. send email if site is down, or if page load for a set page takes longer than say 2 seconds.
What is great is that as you add more tests and variables to check (I use argus.tcp4me.com as the code that monitors various sites) you can quickly use it as a “dashboard” to figure out what might be the issue.
Issues with Media Temple’s grid server are still very much ongoing:
http://www.jimgoings.com/2008/04/media-temple-kills-my-inner-child/
I’ve been with Media Temple on the grid service for about 8 months and it’s been fairly unstable the entire time. My site went down 3 different times on Saturday for example. I run an external monitor to ensure availability and unfortunately, I’m only seeing about 98% uptime right now.
The worst part is that for about 10% of the time, the site loads very slowly. I wrote more with some details on my blog:
http://www.jimgoings.com/2008/04/media-temple-kills-my-inner-child/
[…] Hosting Service: M5 Hosting Previous mention here. These guys were referred to us by a friend and they have done a great job so far. Stay the hell […]
Well here we are in May 2009, and the Grid server at MediaTemple has been down for hours tonight. Thousands of sites are down.
Congratulations on your Epic FAIL (mt)
How many customers can you lose in one day? You’re about to find out.
my website which is hosted at media temple is now down for the second day!!!!!!
this is the worst down time i have ever seen since i ever bought a computer or heard about the internet!!!
Yeah, I did a very similar write up on the sadness that is cluster 5. http://bit.ly/9Su6V3