我发现我的hadoop节点管理器进程没有任何原因崩溃,也没有日志 .

1. server info

红帽企业Linux服务器版本7.2(Maipo)

hadoop 2.8.0

namenode:Intel(R)Xeon(R)CPU E5-2650 v2 @ 2.60GHz,32核,64G ram,2个节点用于namenode,1个用于resourceManager / hive

datatanode:英特尔(R)Xeon(R)CPU E5-2620 v2 @ 2.10GHz,共24核,128G内存,总共5个数据节点

nodemanager conf:

hadoop 233967 5.7 1.9 6424200 2600128? Sl 17:00 1:22 / opt / jdk / bin / java -Dproc_nodemanager -Xmx4096m -agentpath:/usr/lib/abrt-java-connector/libabrt-java-connector.so -agentlib:abrt-java-connector = output = / home / hadoop / hadoop / logs / abrt-agent.log -Xms4g -Xmx4g -Xmn3g -server -XX:SurvivorRatio = 5 -XX:MetaspaceSize = 256M -XX:UseConcMarkSweepGC -XX:DisableExplicitGC -verbose:gc -XX: PrintGCDateStamps -XX:PrintGCDetails -XX:-OmitStackTraceInFastThrow -Xloggc:/home/hadoop/hadoop/logs/yarn_gc.log -XX:ErrorFile = / home / hadoop / hadoop / logs / error / yarn_hs_err.log -Dhadoop.log.dir = / home / hadoop / hadoop / logs -Dyarn.log.dir = / home / hadoop / hadoop / logs -Dhadoop.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd.bmsre.com.log -Dyarn.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd.bmsre.com.log -Dyarn.home.dir = -Dyarn.id.str = hadoop -Dhadoop.root.logger = INFO, RFA -Dyarn.root.logger = INFO,RFA -Djava.library.path = / home / hadoop / hadoop / lib / native -Dyarn.policy.file = hadoop-policy.xml -server -Dhadoop.log.dir = / home / hadoop / hadoop / logs -Dyarn.log.dir = / home / hadoop / hadoop / logs -Dhadoop.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd.bmsre.com.log -Dyarn.log.file = yarn-hadoop-nodemanager-hdp08.hp.sp.prd .bmsre.com.log -Dyarn.home.dir = / home / hadoop / hadoop -Dhadoop.home.dir = / home / hadoop / hadoop -Dhadoop.root.logger = INFO,RFA -Dyarn.root.logger = INFO ,RFA -Djava.library.path = / home / hadoop / hadoop / lib / native -classpath / home / hadoop / hadoop / etc / hadoop:/ home / hadoop / hadoop / etc / hadoop:/ home / hadoop / hadoop /等/ Hadoop的:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop / common / lib目录/:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop /普通/:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop / HDFS:/首页/ Hadoop的/ Hadoop的/股/的Hadoop / HDFS / lib中/:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop / HDFS /:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop /纱/ lib中/:/家庭/ Hadoop的/ Hadoop的/分享/ Hadoop的/纱/:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop / MapReduce的/ lib目录/:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop / MapReduce的/:/家庭/ Hadoop的/ Hadoop的/的contrib /容量调度/的.jar:/home/hadoop/hadoop/contrib/capacity-scheduler/.jar:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop /纱/:/家庭/ Hadoop的/ Hadoop的/股/的Hadoop /雅rn / lib /:/ home / hadoop / hadoop / etc / hadoop / nm-config / log4j.properties org.apache.hadoop.yarn.server.nodemanager.NodeManager

2. when does it crash

如果我在hive中执行这些sql,一些datanode可能会崩溃,我不知道是不是原因,但崩溃时间似乎只是在同一时间 .

select max(ts) from beacon where year = 2017 and month = 6 and day >= 21

delete from beacon where ts = 1498629599829

insert into table beacon partition(year,month,day) select type,page,name,ts,year,month,day from beacon_txt where ts >= 1498629599829

3. what symptoms show when it crash

1)nodemanager进程消失

2)纱线日志中没有例外,最后的datanode日志片段如下:

2017-06-28 17:11:20,189 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://avatarcluster/user/hadoop/.hiveJars/hive-exec-2.1.1-5f4a7e952d29bb8013edd30bbc39476ec56bc381b96b0530a6b2fbbf28e309d3.jar(->/home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/usercache/hadoop/filecache/10/hive-exec-2.1.1-5f4a7e952d29bb8013edd30bbc39476ec56bc381b96b0530a6b2fbbf28e309d3.jar) transitioned from DOWNLOADING to LOCALIZED

2017-06-28 17:11:20,701 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.localizer.LocalizedResource: Resource hdfs://avatarcluster/apps/tez-0.8.5.tar.gz(->/home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/filecache/10/tez-0.8.5.tar.gz) transitioned from DOWNLOADING to LOCALIZED

2017-06-28 17:11:20,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000006 transitioned from LOCALIZING to LOCALIZED

2017-06-28 17:11:20,703 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000011 transitioned from LOCALIZING to LOCALIZED

2017-06-28 17:11:20,740 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000006 transitioned from LOCALIZED to RUNNING

2017-06-28 17:11:20,740 INFO org.apache.hadoop.yarn.server.nodemanager.containermanager.container.ContainerImpl: Container container_1496722904961_18405_01_000011 transitioned from LOCALIZED to RUNNING

2017-06-28 17:11:20,744 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/usercache/hadoop/appcache/application_1496722904961_18405/container_1496722904961_18405_01_000011/default_container_executor.sh]

2017-06-28 17:11:20,744 INFO org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor: launchContainer: [bash, /home/hadoop/hadoop-data/hadoop-tmp-data/nm-local-dir/usercache/hadoop/appcache/application_1496722904961_18405/container_1496722904961_18405_01_000006/default_container_executor.sh]

一些日志显示在/ var / log / messages中

65270 Jun 28 17:11:11 hdp06 systemd: Removed slice user-0.slice.
65271 Jun 28 17:11:11 hdp06 systemd: Stopping user-0.slice.
65272 Jun 28 17:11:33 hdp06 abrt-server: Executable '/opt/jdk1.8.0_71/bin/java' doesn'      t belong to any package and ProcessUnpackaged is set to 'no'
65273 Jun 28 17:11:33 hdp06 abrt-server: 'post-create' on '/var/spool/abrt/ccpp-2017-0      6-28-17:11:20-375612' exited with 1
65274 Jun 28 17:11:33 hdp06 abrt-server: Deleting problem directory '/var/spool/abrt/c      cpp-2017-06-28-17:11:20-375612'
65275 Jun 28 17:11:43 hdp06 systemd-logind: Removed session 206241.
65276 Jun 28 17:12:01 hdp06 systemd: Created slice user-0.slice.
65277 Jun 28 17:12:01 hdp06 systemd: Starting user-0.slice.

似乎java崩溃或正常退出,但没有日志 .

4. coredump

我在yarn-env.sh中添加了一个abrt-java-connector选项

YARN_OPTS =“$ YARN_OPTS -agentpath:/usr/lib/abrt-java-connector/libabrt-java-connector.so -agentlib:abrt-java-connector = output = $ YARN_LOG_DIR / abrt-agent.log”

并在/ var / spool / abrt / ccpp-2017-07-03-14:24:28-63796中创建一些崩溃日志文件

-rw-r----- 1 root abrt          6 Jul  3 14:24 abrt_version
-rw-r----- 1 root abrt          4 Jul  3 14:24 analyzer
-rw-r----- 1 root abrt          6 Jul  3 14:24 architecture
-rw-r----- 1 root abrt        178 Jul  3 14:24 cgroup
-rw-r----- 1 root abrt       1974 Jul  3 14:24 cmdline
-rw-r----- 1 root abrt     380795 Jul  3 14:25 core_backtrace
-rw-r----- 1 root abrt 4887654400 Jul  3 14:24 coredump
-rw-r----- 1 root abrt          1 Jul  3 14:25 count
-rw-r----- 1 root abrt       1072 Jul  3 14:25 dso_list
-rw-r----- 1 root abrt       3318 Jul  3 14:24 environ
-rw-r----- 1 root abrt          0 Jul  3 14:25 event_log
-rw-r----- 1 root abrt         26 Jul  3 14:24 executable
-rw-r----- 1 root abrt         82 Jul  3 14:25 exploitable
-rw-r----- 1 root abrt          5 Jul  3 14:24 global_pid
-rw-r----- 1 root abrt         25 Jul  3 14:24 hostname
-rw-r----- 1 root abrt         21 Jul  3 14:24 kernel
-rw-r----- 1 root abrt         10 Jul  3 14:24 last_occurrence
-rw-r----- 1 root abrt       1323 Jul  3 14:24 limits
-rw-r----- 1 root abrt        135 Jul  3 14:25 machineid
-rw-r----- 1 root abrt      60706 Jul  3 14:24 maps
-rw-r----- 1 root abrt        243 Jul  3 14:24 open_fds
-rw-r----- 1 root abrt        495 Jul  3 14:24 os_info
-rw-r----- 1 root abrt         51 Jul  3 14:24 os_release
-rw-r----- 1 root abrt          5 Jul  3 14:24 pid
-rw-r----- 1 root abrt       1137 Jul  3 14:24 proc_pid_status
-rw-r----- 1 root abrt        149 Jul  3 14:24 pwd
-rw-r----- 1 root abrt         22 Jul  3 14:24 reason
-rw-r----- 1 root abrt          4 Jul  3 14:24 runlevel
-rw-r----- 1 root abrt   10746600 Jul  3 14:25 sosreport.tar.xz
-rw-r----- 1 root abrt         10 Jul  3 14:24 time
-rw-r----- 1 root abrt          4 Jul  3 14:24 type
-rw-r----- 1 root abrt          9 Jul  3 14:24 uid
-rw-r----- 1 root abrt          7 Jul  3 14:24 username
-rw-r----- 1 root abrt         40 Jul  3 14:25 uuid
-rw-r----- 1 root abrt     185370 Jul  3 14:25 var_log_messages

“reason”文件显示“java被SIGSEGV杀死”

如果我执行“gdb / opt / jdk / bin / java coredump”,它会显示

GNU gdb (GDB) Red Hat Enterprise Linux 7.6.1-80.el7
Copyright (C) 2013 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
This is free software: you are free to change and redistribute it.
There is NO WARRANTY, to the extent permitted by law.  Type "show copying"
and "show warranty" for details.
This GDB was configured as "x86_64-redhat-linux-gnu".
For bug reporting instructions, please see:
<http://www.gnu.org/software/gdb/bugs/>...
Reading symbols from /opt/jdk1.8.0_131/bin/java...Missing separate debuginfo for /opt/jdk1.8.0_131/bin/java
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/c9/0f19ee0af98c47ccaa7181853cfd14867bc931.debug
(no debugging symbols found)...done.
[New LWP 63796]
[New LWP 62554]
[New LWP 62605]
[New LWP 62604]
[New LWP 62606]
[New LWP 62575]
[New LWP 62603]
[New LWP 62607]
[New LWP 62576]
[New LWP 62610]
[New LWP 62668]
[New LWP 62708]
[New LWP 62574]
[New LWP 63790]
[New LWP 63717]
[New LWP 63739]
[New LWP 62580]
[New LWP 62579]
[New LWP 63738]
[New LWP 62592]
[New LWP 62577]
[New LWP 62583]
[New LWP 63740]
[New LWP 62581]
[New LWP 62688]
[New LWP 62587]
[New LWP 62597]
[New LWP 63744]
[New LWP 63737]
[New LWP 62591]
[New LWP 62578]
[New LWP 62582]
[New LWP 62758]
[New LWP 62573]
[New LWP 62626]
[New LWP 63720]
[New LWP 63719]
[New LWP 63726]
[New LWP 63727]
[New LWP 62598]
[New LWP 62774]
[New LWP 62705]
[New LWP 62614]
[New LWP 62703]
[New LWP 62593]
[New LWP 62720]
[New LWP 62590]
[New LWP 62690]
[New LWP 63731]
[New LWP 63810]
[New LWP 63724]
[New LWP 62585]
[New LWP 62753]
[New LWP 62682]
[New LWP 62709]
[New LWP 62684]
[New LWP 62773]
[New LWP 62588]
[New LWP 63722]
[New LWP 62595]
[New LWP 62734]
[New LWP 62616]
[New LWP 62728]
[New LWP 62721]
[New LWP 62689]
[New LWP 62769]
[New LWP 62659]
[New LWP 63743]
[New LWP 62726]
[New LWP 62680]
[New LWP 62704]
[New LWP 62750]
[New LWP 63759]
[New LWP 62594]
[New LWP 63791]
[New LWP 62768]
[New LWP 62600]
[New LWP 63741]
[New LWP 62613]
[New LWP 63718]
[New LWP 62710]
[New LWP 62589]
[New LWP 62731]
[New LWP 63735]
[New LWP 62683]
[New LWP 62760]
[New LWP 63801]
[New LWP 62776]
[New LWP 62678]
[New LWP 62615]
[New LWP 62685]
[New LWP 62737]
[New LWP 62599]
[New LWP 63742]
[New LWP 63808]
[New LWP 62755]
[New LWP 62707]
[New LWP 62694]
[New LWP 63729]
[New LWP 63755]
[New LWP 62711]
[New LWP 63725]
[New LWP 63732]
[New LWP 62745]
[New LWP 62596]
[New LWP 62608]
[New LWP 62735]
[New LWP 63721]
[New LWP 62748]
[New LWP 62736]
[New LWP 62712]
[New LWP 63756]
[New LWP 63793]
[New LWP 63787]
[New LWP 63803]
[New LWP 62602]
[New LWP 62743]
[New LWP 62733]
[New LWP 62742]
[New LWP 63710]
[New LWP 62744]
[New LWP 62677]
[New LWP 62739]
[New LWP 62713]
[New LWP 63789]
[New LWP 62601]
[New LWP 63812]
[New LWP 62725]
[New LWP 62724]
[New LWP 63709]
[New LWP 62718]
[New LWP 62759]
[New LWP 62686]
[New LWP 62715]
[New LWP 62740]
[New LWP 62655]
[New LWP 62749]
[New LWP 62722]
[New LWP 63708]
[New LWP 62716]
[New LWP 63800]
[New LWP 62687]
[New LWP 62723]
[New LWP 63733]
[New LWP 62609]
[New LWP 62738]
[New LWP 63707]
[New LWP 62719]
[New LWP 62714]
[New LWP 62691]
[New LWP 62780]
[New LWP 62625]
[New LWP 62778]
[New LWP 63788]
[New LWP 62717]
[New LWP 63802]
[New LWP 62681]
[New LWP 62692]
[New LWP 62730]
[New LWP 63736]
[New LWP 62679]
[New LWP 62693]
[New LWP 63728]
[New LWP 62697]
[New LWP 62729]
[New LWP 62746]
[New LWP 62698]
[New LWP 62747]
[New LWP 63734]
[New LWP 62727]
[New LWP 62695]
[New LWP 62675]
[New LWP 62676]
[New LWP 63711]
[New LWP 63713]
[New LWP 62699]
[New LWP 62752]
[New LWP 62700]
[New LWP 63723]
[New LWP 62706]
[New LWP 62756]
[New LWP 63706]
[New LWP 62702]
[New LWP 63751]
[New LWP 62658]
[New LWP 62779]
[New LWP 62754]
[New LWP 62771]
[New LWP 62701]
[New LWP 62751]
[New LWP 63730]
[New LWP 62612]
[New LWP 62696]
[New LWP 62611]
[New LWP 62757]
[New LWP 62761]
[New LWP 62732]
[New LWP 62772]
[New LWP 62741]
[New LWP 62777]
[New LWP 62775]
[New LWP 62770]
[New LWP 63792]
[New LWP 62586]
[New LWP 62584]
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
Missing separate debuginfo for /opt/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/78/b091327a0bf6d146f8881f285955b4f7f2b712.debug
Missing separate debuginfo for /opt/jdk1.8.0_131/jre/lib/amd64/libverify.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/89/8659e6261ad966b4b638afc8e3dd214896253d.debug
Missing separate debuginfo for /opt/jdk1.8.0_131/jre/lib/amd64/libmanagement.so
Try: yum --enablerepo='*debug*' install /usr/lib/debug/.build-id/88/c0c64eb685c329ad849281135cfe113f3812e8.debug
Core was generated by `/opt/jdk/bin/java -Dproc_nodemanager -Xmx4096m -Xms4g -Xmx4g -Xmn3g -server -XX'.
Program terminated with signal 11, Segmentation fault.
#0  0x00007f2f4460bfd7 in VMError::report_and_die() ()
   from /opt/jdk1.8.0_131/jre/lib/amd64/server/libjvm.so
Missing separate debuginfos, use: debuginfo-install glibc-2.17-105.el7.x86_64 libgcc-4.8.5-4.el7.x86_64 sssd-client-1.13.0-40.el7.x86_64

所有可能的原因?