首页 文章

启动Coherence群集时的PartitionLostEvent

提问于
浏览
1

我有一个Coherence集群,每台机器有5台物理机器和3台JVM . 启动时,我从其中一个节点获取PartitionLostEvent . 这是给出事件的节点的输出:

18 Feb 2014 14:05:02,570 [Logger@9232314 3.5.3/465p5] INFO  Coherence  - 3.5.3/465p5 (thread=DefaultCacheServer, member=2): Started DefaultCacheServer...    

SafeCluster: Name=tmsngCluster
WellKnownAddressList(Size=5,
  WKA{Address=172.17.0.205, Port=4044}
  WKA{Address=172.17.0.202, Port=4044}
  WKA{Address=172.17.0.203, Port=4044}
  WKA{Address=172.17.0.201, Port=4044}
  WKA{Address=172.17.0.204, Port=4044}
  )

MasterMemberSet
  (
  ThisMember=Member(Id=2, Timestamp=2014-02-18 14:05:00.396, Address=172.17.0.205:4044, MachineId=51999, Location=machine:nec05,process:28959,member:INPUT1, Role=Input)
  OldestMember=Member(Id=1, Timestamp=2014-02-18 14:04:30.049, Address=172.17.0.201:4044, MachineId=3867, Location=machine:nec01,process:11301,member:PROCESS1, Role=Process)
  ActualMemberSet=MemberSet(Size=11, BitSetCount=2
    Member(Id=1, Timestamp=2014-02-18 14:04:30.049, Address=172.17.0.201:4044, MachineId=3867, Location=machine:nec01,process:11301,member:PROCESS1, Role=Process)
    Member(Id=2, Timestamp=2014-02-18 14:05:00.396, Address=172.17.0.205:4044, MachineId=51999, Location=machine:nec05,process:28959,member:INPUT1, Role=Input)
    Member(Id=3, Timestamp=2014-02-18 14:05:00.415, Address=172.17.0.204:4045, MachineId=23582, Location=machine:nec04,process:32568,member:PROCESS1, Role=Process)
    Member(Id=4, Timestamp=2014-02-18 14:05:00.417, Address=172.17.0.204:4046, MachineId=23582, Location=machine:nec04,process:32648,member:CLIENT1, Role=Output)
    Member(Id=5, Timestamp=2014-02-18 14:05:00.414, Address=172.17.0.203:4044, MachineId=60701, Location=machine:nec03,process:32077,member:INPUT1, Role=Input)
    Member(Id=6, Timestamp=2014-02-18 14:05:00.441, Address=172.17.0.203:4045, MachineId=60701, Location=machine:nec03,process:32040,member:PROCESS1, Role=Process)
    Member(Id=7, Timestamp=2014-02-18 14:05:00.451, Address=172.17.0.205:4045, MachineId=51999, Location=machine:nec05,process:28928,member:PROCESS1, Role=Process)
    Member(Id=8, Timestamp=2014-02-18 14:05:00.47, Address=172.17.0.204:4044, MachineId=23582, Location=machine:nec04,process:32605,member:INPUT1, Role=Input)
    Member(Id=9, Timestamp=2014-02-18 14:05:00.53, Address=172.17.0.202:4044, MachineId=32284, Location=machine:nec02,process:1785,member:PROCESS1, Role=Process)
    Member(Id=10, Timestamp=2014-02-18 14:05:00.551, Address=172.17.0.203:4046, MachineId=60701, Location=machine:nec03,process:32120,member:CLIENT1, Role=Output)
    Member(Id=11, Timestamp=2014-02-18 14:05:00.568, Address=172.17.0.205:4046, MachineId=51999, Location=machine:nec05,process:28996,member:CLIENT1, Role=Output)
    )
  RecycleMillis=240000
  RecycleSet=MemberSet(Size=0, BitSetCount=0
    )
  )

Services
  (
  TcpRing{TcpSocketAccepter{State=STATE_OPEN, ServerSocket=172.17.0.205:4044}, Connections=[1]}
  ClusterService{Name=Cluster, State=(SERVICE_STARTED, STATE_JOINED), Id=0, Version=3.5, OldestMemberId=1}
  InvocationService{Name=Management, State=(SERVICE_STARTED), Id=1, Version=3.1, OldestMemberId=1}
  DistributedCache{Name=BackService, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=1021, BackupCount=1, AssignedPartitions=0, BackupPartitions=0}
  DistributedCache{Name=InputBackService, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=1021, BackupCount=1, AssignedPartitions=770, BackupPartitions=0}
  ReplicatedCache{Name=ReplicatedCache, State=(SERVICE_STARTED), Id=7, Version=3.0, OldestMemberId=1}
  InvocationService{Name=DefaultInvocationService, State=(SERVICE_STARTED), Id=15, Version=3.1, OldestMemberId=2}
  )

18 Feb 2014 14:05:02,604 [Logger@9232314 3.5.3/465p5] WARN  Coherence  - 3.5.3/465p5 (thread=DistributedCache:InputBackService, member=2): Assigned 1021 orphaned primary partitions
18 Feb 2014 14:05:02,654 [DistributedCache:BackService:EventDispatcher] INFO  server.common.watchdog.MemberLeftListener  - Member joined: Member(Id=9, Timestamp=2014-02-18 14:05:00.53, Address=172.17.0.202:4044, MachineId=32284, Location=machine:nec02,process:1785,member:PROCESS1, Role=Process)
18 Feb 2014 14:05:02,655 [DistributedCache:BackService:EventDispatcher] INFO  server.common.master.MasterServiceImpl  - Member joined: Member(Id=9, Timestamp=2014-02-18 14:05:00.53, Address=172.17.0.202:4044, MachineId=32284, Location=machine:nec02,process:1785,member:PROCESS1, Role=Process)
18 Feb 2014 14:05:02,657 [Environment.Background.Executor:Thread-2] ERROR server.common.watchdog.LostPartitionsEventProcessor  - Partitions are lost for DistributedCache{Name=InputBackService, State=(SERVICE_STARTED), LocalStorage=enabled, PartitionCount=1021, BackupCount=1, AssignedPartitions=1021, BackupPartitions=0}

我正在使用Coherence 3.5.3(patch5) .

这是服务的配置:

<distributed-scheme>
        <scheme-name>inputDistributedScheme</scheme-name>
        <service-name>InputBackService</service-name>
        <backing-map-scheme>
            <local-scheme>
                <service-name>InputBackLocalService</service-name>
            </local-scheme>
        </backing-map-scheme>
        <partition-listener>
            <class-name>com.company.tmsng.server.common.watchdog.LostPartitionListener</class-name>
        </partition-listener>
        <local-storage system-property="tangosol.coherence.inputNode">false</local-storage>
        <autostart system-property="tangosol.coherence.inputNode">false</autostart>
    </distributed-scheme>

为什么在群集仍在启动时会收到此事件?

2 回答

  • 0

    记住类似的问题 . 在我们的例子中,问题是机器之间的不匹配jar版本(物理),因为我们加载了一个包含自己的CachePersistency和入口处理器实现的自定义jar . 如果您加载了此类文件(自定义),请检查您的版本甚至MD5或哈希值 .

  • 0

    首先,我建议(如果可能的话)升级到更新的版本; 3.5在这一点上变得相当古老! (目前的版本是12.1.2 . )

    其次,看起来分区不会从显示的消息中丢失,而是服务器声明它们(因为没有其他服务器拥有它们) . 这是要弄清楚的事情 . 我从未见过这种行为,但它可能表明某种形式的沟通错误;我检查所有服务器上的日志以进行验证 .

相关问题