Hello from MityLytics!
We have some information from the dark recesses of AWS and High Performance Big Data clusters that we’d like to share with you.
As EC2 and AWS cluster users ourselves for a while we have come to love the convenience of the Big Data set-up either through EMR or EC2 cluster, notwithstanding the annual outages 😉 what we don’t love so much about AWS is the pricing, the performance of the high-end instances and the overall diminishing returns with higher-end clusters and this goes not just for AWS but for other Cloud operators as well.
An example: All the 10Gig instances at AWS deliver only 4Gb/sec of storage bandwidth and about the same peak network bandwidth within the cluster so essentially for 2.5 times the cost of 1 Gbits/sec the peak sustained performance is 4Gb/sec with minimum latency being 200microseconds, maximum latency of 500microseconds and 90% of packets experienced latency of 400microseconds. Yes there are more vCPUS and either increased memory or increased local storage, but the effects of lack of network throughput and latency add up due to the scale of the distributed transactions. What does this mean? Data driven, distributed systems be it Hadoop Mapreduce or Spark or Hive become network bound in this environment so they will not experience linear speedup with increased vCPUs or with memory for Spark or with increased SSD capacity for Mapreduce. In the end spending $120,000 per-year on a 4-node high-performance cluster with multiple terabytes of storage or terabytes of RAM or 10s of vCPUS is not very attractive if there is no linear speedup. What makes it worse is, if additional capacity is required on-demand in case your dataset size increases, the cost per instance almost doubles in that case since the prices referred to above have been monthly costs for reserved instances. So in essence if you needed to increase your capacity by 25% over the year you would end up paying 50% more. So closer to $180,000/year.
So there are a couple of options with their associated trade-offs
- Move your big data cluster in-house or lease some space in a colo with your choice of hardware and software, have somebody manage it. What is the cost here, I would say it would be more than AWS for a 4 node cluster, given the operational costs, admin costs etc, but costs would be less than AWS for a 16 node cluster. But then, this leads to increased dependence on colo operator when it comes to scaling, troubleshooting of infrastructure in a unified end-to-end manner and you will have to hire more folks to run your infrastructure.
- Move to a different provider – Softlayer, GCE, Azure, GoGrid, etc. but then I would think AWS has better reliability and is much more mature.
- Tune and optimize your AWS deployment.
So there’s no perfect solution for all, so here comes the shameless pitch 🙂
Engage with MityLytics – Here’s what we can do for you
- We can optimize, tune your cluster by realizing higher throughput for analytics and storage be it in AWS or at any other cloud infrastructure provider or in-house or at a 3rd party colo. We will suggest the most appropriate solutions for you given your workloads and size of your platform
- We will help you lower your costs by right-sizing your clusters while maintaining application SLAs.
- We can help with capacity planning by running simulated workloads that model your future growth and identify any potential choke points along the way as you scale up.
- We will provide workarounds that will help you get around performance limiting parts of the Big data platform codebases vis-a-vis infrastructure.
- We will help you monitor, troubleshoot and debug your clusters in an end-to-end manner so that your IT team can instantly pinpoint problems in infrastructure or managed service components thus lowering operating costs.