Angle of Illumination: Stop the World

The machine
I have an 8 CPU VM server with 32GB RAM running Solr. My JVM is 1.6.0_37 with the following JVM settings:

-Xms28g
-Xmx28g
-XX:NewSize=6g
-XX:MaxNewSize=6g
-XX:SurvivorRatio=4
-XX:PermSize=512m
-XX:MaxPermSize=512m
-XX:SoftRefLRUPolicyMSPerMB=500
-XX:+PrintCommandLineFlags
-XX:+HeapDumpOnOutOfMemoryError
-XX:+DumpGCHistoryOnOutOfMemory
-XX:+DumpDetailedClassStatisticOnOutOfMemory
-XX:HeapDumpPath=/opt/alfresco/tomcat/dumps
-verbose:gc
-Xloggc:/opt/alfresco/tomcat/dumps/gc-logs/gc-2014-03-13-10-00-07.log
-XX:+GCHistory
-XX:+CMSClassUnloadingEnabled
-XX:+DisableExplicitGC
-XX:+PrintGCDateStamps
-XX:+PrintGCDetails
-XX:+PrintTenuringDistribution
-XX:+UseCompressedOops
-XX:+UseConcMarkSweepGC
-XX:+UseParNewGC

The reason for such huge heap is that the Solr data are about 130 GB and Sorl is heavily utilized from around 100 concurrent threads performing text search on documents.
I notice that Sorl application pauses for some time without responding. I discovered on the GC logs the following problem:

2014-03-17T14:29:23.438+0100: 7991.661: [GC2014-03-17T14:29:23.438+0100: 7991.661: [ParNew (promotion failed)
Desired survivor size 268435456 bytes, new threshold 15 (max 15)
- age   1:  125233576 bytes,  125233576 total
: 2621440K->2228808K(2621440K), 5.9399450 secs]2014-03-17T14:29:29.378+0100: 7997.601: [CMS2014-03-17T14:29:32.715+0100: 8000.938: [CMS-concurrent-sweep: 21.573/33.920 secs] [Times: user=118.80 sys=3.29, real=33.91 secs]
 (concurrent mode failure): 20465547K->11174034K(26214400K), 33.5774490 secs] 22674213K->11174034K(28835840K), [CMS Perm : 47873K->47648K(524288K)], 39.5176760 secs] [Times: user=46.19 sys=2.36, real=39.51 secs]

This issue is summarized in the official ORACLE documentation for JVM v6 as follows:

..a concurrent collection needs to be started at a time such that the collection can finish before the tenured generation becomes full; otherwise the application would observe longer pauses due to concurrent mode failure. There are several ways a concurrent collection can be started.

See: Concurrent Mode Failure

The message "concurrent mode failure" signifies that the concurrent collection of the tenured generation did not finish before the tenured generation became full. In other words, the new generation is filling up too fast, it is overflowing to tenured generation but the CMS could not clear out the tenured generation in the background. When a concurrent mode failure happens, the low pause collector does a stop-the-world (STW) collection. All the application threads are stopped, a different algorithm is used to collect the tenured generation (our particular flavor of a mark-sweep-compact), the applications threads are started again, and life goes on....

Seems that a concurrent mode failure is responsible for a "Stop the World" JVM pausing.
See also another wonderful Blog about the same issue here :

In order to treat problem we tune the following JVM flags:

-XX:CMSInitiatingOccupancyFraction=10

Indicating that a concurrent collection will start if the occupancy of the tenured generation exceeds 10% instead of 92% that is the default threshold.

-XX:CMSIncrementalSafetyFactor=100

Indicating to the JVM GC to start a concurrent collection at the next opportunity without any delay.
See also Oracles GC tunning instructions here

Labels

Friday, 14 March 2014

Concurrent mode failure: Tuning JVM GC for Solr