我开始玩hadoop 2.6.0,并根据official documentation Build 一个伪分布式单节点系统 .
当我运行简单的Map Reduce(MR1)示例(请参阅“伪分布式操作 - >执行”)时,总执行时间约为 . 7秒更确切地说,bash的时间给出:
real 0m6.769s
user 0m7.375s
sys 0m0.400s
当我通过Yarn(MR2)运行相同的示例(参见“伪分布式操作 - >单节点上的YARN”)时,总执行时间约为 . 100秒,因此非常慢 . bash的时间给出:
real 1m38.422s
user 0m4.798s
sys 0m0.319s
因此,(由于某种原因)在用户空间之外存在大量开销 . 但为什么?
两个例子都是通过执行的
bin/hadoop jar share/hadoop/mapreduce/hadoop-mapreduce-examples-2.6.0.jar grep input output 'dfs[a-z.]+'
这里有关于纯Map Reduce(MR1)的更多细节:
(...)
15/04/10 21:12:17 INFO mapreduce.Job: Counters: 38
File System Counters
FILE: Number of bytes read=125642
FILE: Number of bytes written=1009217
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=154548
HDFS: Number of bytes written=1071
HDFS: Number of read operations=157
HDFS: Number of large read operations=0
HDFS: Number of write operations=16
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=129
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=0
CPU time spent (ms)=0
Physical memory (bytes) snapshot=0
Virtual memory (bytes) snapshot=0
Total committed heap usage (bytes)=1062207488
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197
real 0m6.769s
user 0m7.375s
sys 0m0.400s
有关纱线(MR2)的更多详细信息:
(...)
15/04/10 21:20:31 INFO mapreduce.Job: Counters: 49
File System Counters
FILE: Number of bytes read=291
FILE: Number of bytes written=211001
FILE: Number of read operations=0
FILE: Number of large read operations=0
FILE: Number of write operations=0
HDFS: Number of bytes read=566
HDFS: Number of bytes written=197
HDFS: Number of read operations=7
HDFS: Number of large read operations=0
HDFS: Number of write operations=2
Job Counters
Launched map tasks=1
Launched reduce tasks=1
Data-local map tasks=1
Total time spent by all maps in occupied slots (ms)=2411
Total time spent by all reduces in occupied slots (ms)=2717
Total time spent by all map tasks (ms)=2411
Total time spent by all reduce tasks (ms)=2717
Total vcore-seconds taken by all map tasks=2411
Total vcore-seconds taken by all reduce tasks=2717
Total megabyte-seconds taken by all map tasks=2468864
Total megabyte-seconds taken by all reduce tasks=2782208
Map-Reduce Framework
Map input records=11
Map output records=11
Map output bytes=263
Map output materialized bytes=291
Input split bytes=129
Combine input records=0
Combine output records=0
Reduce input groups=5
Reduce shuffle bytes=291
Reduce input records=11
Reduce output records=11
Spilled Records=22
Shuffled Maps =1
Failed Shuffles=0
Merged Map outputs=1
GC time elapsed (ms)=68
CPU time spent (ms)=1160
Physical memory (bytes) snapshot=432250880
Virtual memory (bytes) snapshot=1719066624
Total committed heap usage (bytes)=353370112
Shuffle Errors
BAD_ID=0
CONNECTION=0
IO_ERROR=0
WRONG_LENGTH=0
WRONG_MAP=0
WRONG_REDUCE=0
File Input Format Counters
Bytes Read=437
File Output Format Counters
Bytes Written=197
real 1m38.422s
user 0m4.798s
sys 0m0.319s
谁能解释这个性能差距以及如何解决它?
1 回答
如果你有一个非常庞大的集群并且想要为不同的应用程序使用相同的集群,例如hadoop,Spark,Kafka e.t.c,YARN会派上用场 . 它旨在支持许多平台 . 我认为你能够看到由于dafault配置的时间差异,调整集群会给我更好的性能 .