我如何调试为什么在GC期间停止线程这么长时间？ [重复]-Java 学习之路

这个问题在这里已有答案：

How do I get G1 to print more log details? 2个答案

我正在运行一个Cassandra实例

$ java -version
java version "1.8.0_121"
Java(TM) SE Runtime Environment (build 1.8.0_121-b34)
Java HotSpot(TM) 64-Bit Server VM (build 25.121-b34, mixed mode)

以及以下垃圾收集选项

-Xms20000m
-Xmx20000m
-Xmn10000m
-Xss256k
-XX:+UseG1GC
-XX:MaxGCPauseMillis=1000
-XX:GCTimeRatio=97
-XX:MaxTenuringThreshold=2
-XX:+UseTLAB
-XX:+ResizeTLAB
-XX:+PerfDisableSharedMem
-XX:+PrintGCDetails
-XX:+PrintGCDateStamps
-XX:+PrintHeapAtGC
-XX:+PrintGCApplicationStoppedTime
-XX:+PrintAdaptiveSizePolicy
-Xloggc:/dev/shm/cassandra-gc.log

在垃圾收集周期中，JVM记录以下内容

2017-08-15T17:09:08.506+0000: 71712.670: [GC pause (G1 Evacuation Pause) (young)
Desired survivor size 658505728 bytes, new threshold 2 (max 2)
- age   1:   26818496 bytes,   26818496 total
- age   2:   14940744 bytes,   41759240 total
 71712.670: [G1Ergonomics (CSet Construction) start choosing CSet, _pending_cards: 222556, predicted base time: 50.67 ms, remaining time: 949.33 ms, target pause time: 1000.00 ms]
 71712.670: [G1Ergonomics (CSet Construction) add young regions to CSet, eden: 1242 regions, survivors: 8 regions, predicted young region time: 12.00 ms]
 71712.670: [G1Ergonomics (CSet Construction) finish choosing CSet, eden: 1242 regions, survivors: 8 regions, old: 0 regions, predicted pause time: 62.68 ms, target pause time: 1000.00 ms]
 71712.736: [G1Ergonomics (Heap Sizing) attempt heap expansion, reason: recent GC overhead higher than threshold after GC, recent GC overhead: 1.39 %, threshold: 1.02 %, uncommitted: 0 bytes, calculated expansion amount: 0 bytes (20.00 %)]
, 0.0662411 secs]
   [Parallel Time: 57.5 ms, GC Workers: 33]
      [GC Worker Start (ms): Min: 71712670.9, Avg: 71712671.2, Max: 71712671.4, Diff: 0.4]
      [Ext Root Scanning (ms): Min: 6.4, Avg: 7.0, Max: 10.2, Diff: 3.8, Sum: 232.3]
      [Update RS (ms): Min: 34.0, Avg: 37.0, Max: 38.9, Diff: 4.9, Sum: 1221.2]
         [Processed Buffers: Min: 27, Avg: 48.5, Max: 66, Diff: 39, Sum: 1599]
      [Scan RS (ms): Min: 0.0, Avg: 0.2, Max: 0.5, Diff: 0.4, Sum: 5.1]
      [Code Root Scanning (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.1]
      [Object Copy (ms): Min: 11.3, Avg: 12.6, Max: 12.9, Diff: 1.7, Sum: 415.0]
      [Termination (ms): Min: 0.0, Avg: 0.0, Max: 0.0, Diff: 0.0, Sum: 0.4]
         [Termination Attempts: Min: 1, Avg: 2.1, Max: 4, Diff: 3, Sum: 68]
      [GC Worker Other (ms): Min: 0.0, Avg: 0.1, Max: 0.3, Diff: 0.3, Sum: 4.5]
      [GC Worker Total (ms): Min: 56.7, Avg: 56.9, Max: 57.3, Diff: 0.7, Sum: 1878.5]
      [GC Worker End (ms): Min: 71712728.0, Avg: 71712728.1, Max: 71712728.3, Diff: 0.3]
   [Code Root Fixup: 0.5 ms]
   [Code Root Purge: 0.0 ms]
   [Clear CT: 1.1 ms]
   [Other: 7.1 ms]
      [Choose CSet: 0.0 ms]
      [Ref Proc: 2.7 ms]
      [Ref Enq: 0.0 ms]
      [Redirty Cards: 0.8 ms]
      [Humongous Register: 0.1 ms]
      [Humongous Reclaim: 0.0 ms]
      [Free CSet: 1.5 ms]
   [Eden: 9936.0M(9936.0M)->0.0B(9920.0M) Survivors: 64.0M->80.0M Heap: 18.2G(19.5G)->8718.4M(19.5G)]
Heap after GC invocations=8876 (full 0):
 garbage-first heap   total 20480000K, used 8927594K [0x00000002de000000, 0x00000002de804e20, 0x00000007c0000000)
  region size 8192K, 10 young (81920K), 10 survivors (81920K)
 Metaspace       used 44399K, capacity 44978K, committed 45184K, reserved 1089536K
  class space    used 4591K, capacity 4729K, committed 4760K, reserved 1048576K
}
 [Times: user=0.88 sys=0.24, real=0.07 secs]
2017-08-15T17:09:08.573+0000: 71712.737: Total time for which application threads were stopped: 4.1542217 seconds, Stopping threads took: 4.0814458 seconds

我会提请注意最后一行

应用程序线程停止的总时间：4.1542217秒，停止线程花费：4.0814458秒

我无法从任何打印的GC细节中看出为什么停止线程需要这么长时间 .

其他（long-ish）暂停具有平凡的低线程停止时间

2017-08-15T17:09:26.646+0000: 71730.810: Total time for which application threads were stopped: 0.1165019 seconds, Stopping threads took: 0.0005363 seconds
2017-08-15T17:10:36.982+0000: 71801.146: Total time for which application threads were stopped: 0.1444364 seconds, Stopping threads took: 0.0006902 seconds
2017-08-15T17:10:42.091+0000: 71806.255: Total time for which application threads were stopped: 0.2288663 seconds, Stopping threads took: 0.0064196 seconds
2017-08-15T17:10:52.601+0000: 71816.765: Total time for which application threads were stopped: 0.2781895 seconds, Stopping threads took: 0.0005429 seconds
2017-08-15T17:10:56.992+0000: 71821.156: Total time for which application threads were stopped: 0.1380046 seconds, Stopping threads took: 0.0006240 seconds

该应用程序至少有1000个线程，但对我来说停止它们将占用GC事件持续时间的98％是没有意义的 .

我首先假设这是一个线程达到安全点所花费的时间，所以我尝试添加

-XX:+SafepointTimeout -XX:SafepointTimeoutDelay=500

但记录的事件与长GC事件不一致（除非线程都需要499ms来达到他们的安全点...）

还有什么可以延迟线程停止时间？

请注意，无论我使用CMS还是G1以及是否write the GC logs to disk (SSD) or to memory，都会出现这些暂停 .

1 回答

1

您可以添加 -XX:+PrintSafepointStatistics –XX:PrintSafepointStatisticsCount=1 以在安全点上启用详细信息诊断 .

-XX:+PrintJNIGCStalls 是另一种诊断方法option可能会对问题有所了解 .

JNIGCStalls 不计入 SafepointTimeout .

其他原因一般可能是系统过载 . 我建议你检查"load average"中的尖峰是否与GC异常相关 . 虽然"load average"指标有自己的quirks .

如果您处于虚拟化环境中，零星的系统冻结可能是一种常态 . 这种冻结也可以被认为是“负载平均”OS度量的峰值 .

回复于 2024-05-03T16:45:38+08:00

我如何调试为什么在GC期间停止线程这么长时间？ [重复]

1 回答

相关问题