Running a 2400 Akka Nodes Cluster on Google Compute Engine
- November 20, 2013
We’re so excited that Google gave us 3000 cores to test a large Akka cluster on Google Compute Engine (GCE), which is an infrastructure-as-a-service that lets you run your large-scale computing workloads on Linux virtual machines hosted on Google's infrastructure. Google views Akka as a powerful tool for exploring the limits of GCE to run Akka/Java applications, so we were honored to participate with their performance tests.
Delightfully, we were able to operate a cluster of 2400 nodes. We are extremely satisfied with this result, which validates that Akka can be used for large clusters on GCE.
Our overall impression of Google Compute Engine is that everything just works, which is more than you normally expect from an infrastructure as a service. It is easy to operate and understand the tools and APIs. The stability is great, and the speed of starting new instances is outstanding, allowing us to SSH into them only 10 seconds after spawning them.
The test was performed by first starting 1500 instances in batches of 20 every 3 minutes. Each instance was hosting one JVM running a small test application (see below for details), which joined the cluster with the normal seed nodes process. In the tests 3 seed nodes were used. Akka Cluster is a decentralized peer-to-peer cluster that is using a gossip protocol to spread membership changes. Joining a node involves two consecutive state changes that are spread to all members in the cluster one after the other. We measured the time it took from initiating the join until all nodes have seen too many new nodes as a full member, i.e. the cluster event MemberUp was published on all nodes.
As can be seen in the above chart it typically takes 15 to 30 seconds to add 20 nodes. Note that this is the duration until the cluster has a consistent view of the new member, without any central coordination. In an earlier test we found that adding a single node takes around 15 seconds, it is therefore not substantially faster than the best case of adding nodes en bloc. The reason for this is that we optimized the gossip protocol for this case since bulk additions are a typical use case when starting up a cluster. You will see that taken to the max in another blog post.
The time to join increases with the size of the cluster, but not drastically; the theoretical expectation would be logarithmic behavior. For our implementation this holds only up to 500 nodes because we gradually reduce the bias of gossip peer selection when the cluster grows beyond that size.
Nodes were added slowly—stretching the process over a total period of four hours—to also verify the cluster’s stability over time with repeated membership changes. There was one problem, due to a known issue, when the cluster consisted of 940 nodes. You might see the gap in the chart above. One node had to be downed manually before continuing the test.
During periods without cluster membership changes the total network traffic of 1500 nodes amounted to 100 Mbits/s, which means around 8 kB/s aggregated average input and output on each node. The average CPU utilization across all nodes was around 10%.
We did not only want to see that we could form a large cluster, but also that it could detect and handle failures. We simulated a long garbage collection pause of one JVM with kill-STOP followed by kill-CONT. The failure was detected and the node was marked as unreachable, since it stopped sending out heartbeat messages, as expected. When it was resumed the monitoring peers marked it as reachable again. It took around 23 seconds from the stop command until all nodes had seen it back as reachable again. Repetition of this test 3 times confirmed similar results. This was done with a cluster of 1500 nodes.
We tried manual downing of one node in a 1500 nodes cluster with the cluster command line script; it took 42 seconds until all nodes had seen it being removed. Setting a node’s status to “down” that way is in essence what happens when the machine dies—it will stop responding and not come back. We also tried the other path, which is to tell a node to gracefully leave the cluster, as you would normally do for maintenance, and this took 61 seconds.
1500 nodes is a fantastic result, and to be honest far beyond our expectations. Our instance quota was 1500 instances, but why not continue and add more JVMs on the running instances until it breaks?
We had to add 900 more JVMs before it resisted to serve more. In this part of the test we added 100 nodes in each batch. During the later join steps several unreachable nodes were detected, but the cluster healed itself in the same way as during the STOP/CONT experiment mentioned above. This worked up to 2400 nodes, but after that it finally broke down: when adding more nodes we observed long garbage collection pauses and many nodes were marked as unreachable and did not come back. To our current knowledge this is not a hard limit, which means that we will eventually overcome it—as soon as someone can motivate why they need an even larger cluster.
Two notable tuning adjustments of the configuration were done compared to default values.
akka.remote.transport-failure-detector.heartbeat-interval = 30 s
Heartbeating of the remoting layer’s transport failure detector consumed network bandwidth unnecessarily, since TCP itself performs essentially the same duty. We reduced this overhead by configuring the heartbeat interval to 30 seconds. This is recommended for large (>200 nodes) clusters. It does not have any significant drawback as long as TCP is used as the transport protocol (which is currently the only supported option).
akka.cluster.failure-detector.acceptable-heartbeat-pause = 5 s akka.cluster.failure-detector.threshold = 12.0
This makes the cluster failure detector slightly more tolerant to variations in heartbeat inter-arrival times. Failures are still detected within 7 seconds.
The full configuration can be found in application.conf.
We used the Oracle JVM, Java version 1.7.0_40 with the following JVM settings:
-Xms1538M -Xmx1538M -XX:+UseParallelGC -XX:+UseCompressedOops -XX:+HeapDumpOnOutOfMemoryError -XX:HeapDumpPath=$LOG_DIR -XX:+PrintGCDetails -XX:+PrintGCTimeStamps -XX:+UnlockCommercialFeatures -XX:+FlightRecorder -XX:FlightRecorderOptions=defaultrecording=true
The GCE instances used in this test were of type n1-standard-2 in zone europe-west1-a. It is priced at $0.253/h, has 7.5 GB memory, and 5.5 GCEU. GCEU (Google Compute Engine Unit), pronounced GQ, is a unit of CPU capacity that is used to describe the compute power of the instance types. 2.75 GCEUs represent the minimum power of one logical core (a hardware hyper-thread) on the Sandy Bridge platform.
You can do a lot of interesting things with a cluster consisting of 3000 cores and 11 TB of memory.
The instances were installed with GCE Debian Linux (image: debian-7-wheezy-v20130816, kernel: gce-v20130813).
How to Run Akka on Google Compute Engine
The GCE command line tool, gcutil, is excellent for automating the startup of a cluster with ordinary bash scripts. In a follow-up blog post we will share our experience with operating GCE and give some advice on how to deploy Akka applications on GCE.
We would like to send a big Thank You to Google for making it possible to conduct these tests. It made Akka a better product, as we found several issues and areas that needed optimization. Google Compute Engine is an impressive infrastructure as a service, with the stability, performance and elasticity that you need for high-end Akka systems. Stay tuned for the next post about the tests!