我有solr cloud在16GB RAM内存上运行,2个solr节点(相同的ip)用于分片,嵌入式zookeeper . 我在默认配置上运行solr,虽然默认配置附带-Xms5g-Xmx5g,但我在Solr仪表板上看到内存有时使用15gb的最大16GB内存 . 这几个月它顺利运行 . 它有300-900个集合,其文档大小在每个集合中分发1到8.000.000个文档(其罕见的案例1集合具有超过100万的文档) .

但目前solr实例主要是每天上午7-8点左右 . 您可以在下面看到日志

ClientCnxn
Client session timed out,​ have not heard from server in 11856ms for sessionid 0x16784ac54710000
12/7/2018, 7:19:52 AM
WARN false
NIOServerCnxn
caught end of stream exception
12/7/2018, 7:19:53 AM
WARN false
NIOServerCnxn
caught end of stream exception
12/7/2018, 7:19:53 AM
WARN false
ConnectionManager
Watcher org.apache.solr.common.cloud.ConnectionManager@422f5928 name: ZooKeeperConnection Watcher:localhost:9983 got event WatchedEvent state:Disconnected type:None path:null path: null type: None
12/7/2018, 7:19:53 AM
WARN false
ConnectionManager
zkClient has disconnected
12/7/2018, 7:19:55 AM
WARN false
ClientCnxn
Unable to reconnect to ZooKeeper service,​ session 0x16784ac54710000 has expired
12/7/2018, 7:19:55 AM
WARN false
ConnectionManager
Watcher org.apache.solr.common.cloud.ConnectionManager@422f5928 name: ZooKeeperConnection Watcher:localhost:9983 got event WatchedEvent state:Expired type:None path:null path: null type: None
12/7/2018, 7:19:55 AM
WARN false
ConnectionManager
Our previous ZooKeeper session was expired. Attempting to reconnect to recover relationship with ZooKeeper...
12/7/2018, 7:19:55 AM
ERROR false
RequestHandlerBase
org.apache.solr.common.SolrException: Cannot talk to ZooKeeper - Updates are disabled.
12/7/2018, 7:19:55 AM
WARN false
OverseerTriggerThread
OverseerTriggerThread woken up but we are closed,​ exiting.
12/7/2018, 7:19:55 AM
ERROR false
SolrCmdDistributor
org.apache.solr.client.solrj.impl.HttpSolrClient$RemoteSolrException: Error from server at http://10.250.200.217:8983/solr/BL_indomaret_oct_update.csv_shard1_replica_n1: Cannot talk to ZooKeeper - Updates are disabled.
12/7/2018, 7:19:55 AM
WARN false
DistributedUpdateProcessor
Error sending update to http://10.250.200.217:8983/solr
12/7/2018, 7:19:55 AM
ERROR false
Overseer
could not read the data
12/7/2018, 7:19:55 AM
WARN false
DefaultConnectionStrategy
Connection expired - starting a new one...
12/7/2018, 7:20:04 AM
ERROR false
RequestHandlerBase
org.apache.solr.common.SolrException: no servers hosting shard: shard1

我想像[here] [1]那样调整GC的G1配置,但我想确认GC暂停是根本原因还是其他原因,如果我们从日志中看到的话 . 使用CMS的默认配置

这是来自第一个solr节点的日志(使用jstat -gcutil)[bin] $ ./jstat -gcutil 31543 1000

S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
 14.39   0.00  84.22  42.04  93.99  89.40   2548  300.876    18    3.981  304.858
 14.39   0.00  84.22  42.04  93.99  89.40   2548  300.876    18    3.981  304.858
 14.39   0.00  84.22  42.04  93.99  89.40   2548  300.876    18    3.981  304.858

这个来自第二个solr节点 . ./jstat -gcutil 32223 1000

S0     S1     E      O      M     CCS    YGC     YGCT    FGC    FGCT     GCT
  0.00  11.95   8.29  38.66  94.10  88.57   2121  206.174     8    2.076  208.251
  0.00  11.95   8.29  38.66  94.10  88.57   2121  206.174     8    2.076  208.251
  0.00  11.95   8.47  38.66  94.10  88.57   2121  206.174     8    2.076  208.251

以下是solr_gc_current.log

2018-12-09T02:15:02.443+0700: 177309.558: Total time for which application threads were stopped: 0.1199759 seconds, Stopping threads took: 0.0046451 seconds
2018-12-09T02:15:25.680+0700: 177332.795: Total time for which application threads were stopped: 0.0309449 seconds, Stopping threads took: 0.0035637 seconds
2018-12-09T02:16:07.542+0700: 177374.657: Total time for which application threads were stopped: 0.0332466 seconds, Stopping threads took: 0.0036185 seconds
2018-12-09T02:16:07.576+0700: 177374.691: Total time for which application threads were stopped: 0.0306116 seconds, Stopping threads took: 0.0034811 seconds
2018-12-09T02:16:16.697+0700: 177383.812: Total time for which application threads were stopped: 0.0295741 seconds, Stopping threads took: 0.0035389 seconds
2018-12-09T02:16:31.868+0700: 177398.983: Total time for which application threads were stopped: 0.0390703 seconds, Stopping threads took: 0.0049162 seconds
2018-12-09T02:18:27.006+0700: 177514.121: Total time for which application threads were stopped: 0.0310958 seconds, Stopping threads took: 0.0037218 seconds
2018-12-09T02:18:27.964+0700: 177515.080: Total time for which application threads were stopped: 0.0360488 seconds, Stopping threads took: 0.0047906 seconds
{Heap before GC invocations=2120 (full 4):
 par new generation   total 1092288K, used 898004K [0x0000000680000000, 0x00000006d0000000, 0x00000006d0000000)
  eden space 873856K,  99% used [0x0000000680000000, 0x00000006b555fee0, 0x00000006b5560000)
  from space 218432K,  11% used [0x00000006b5560000, 0x00000006b6cf5470, 0x00000006c2ab0000)
  to   space 218432K,   0% used [0x00000006c2ab0000, 0x00000006c2ab0000, 0x00000006d0000000)
 concurrent mark-sweep generation total 3932160K, used 1519752K [0x00000006d0000000, 0x00000007c0000000, 0x00000007c0000000)
 Metaspace       used 46887K, capacity 48033K, committed 49828K, reserved 1093632K
  class space    used 4960K, capacity 5217K, committed 5600K, reserved 1048576K
2018-12-09T02:18:28.456+0700: 177515.571: [GC (Allocation Failure) 2018-12-09T02:18:28.460+0700: 177515.575: [ParNew
Desired survivor size 201306928 bytes, new threshold 8 (max 8)
- age   1:    8579280 bytes,    8579280 total
- age   2:    6635784 bytes,   15215064 total
- age   3:     746072 bytes,   15961136 total
- age   4:    1137888 bytes,   17099024 total
- age   5:     273208 bytes,   17372232 total
- age   6:    1769872 bytes,   19142104 total
- age   7:    1744032 bytes,   20886136 total
- age   8:     277464 bytes,   21163600 total
: 898004K->26092K(1092288K), 0.0716839 secs] 2417757K->1546202K(5024448K), 0.0797908 secs] [Times: user=0.24 sys=0.00, real=0.08 secs]

Fyi,这两天我的系统运行没有问题 . solr仪表板显示其使用的67%(10GB)最大16GB . 第一个片段日志是发生错误/关闭时的日志 . 但是当系统在最后几天顺利运行时,gc日志是片段,但是我想做好准备,以防它再次发生 . 谢谢,感谢您的帮助和时间