为什么Spark会因java.lang.OutOfMemoryError而失败：超出GC开销限制？-Java 学习之路

我正在尝试实现一个在Spark之前工作正常的Hadoop Map / Reduce作业 . Spark应用程序定义如下：

val data = spark.textFile(file, 2).cache()
val result = data
  .map(//some pre-processing)
  .map(docWeightPar => (docWeightPar(0),docWeightPar(1))))
  .flatMap(line => MyFunctions.combine(line))
  .reduceByKey( _ + _)

MyFunctions.combine 的位置

def combine(tuples: Array[(String, String)]): IndexedSeq[(String,Double)] =
  for (i <- 0 to tuples.length - 2;
       j <- 1 to tuples.length - 1
  ) yield (toKey(tuples(i)._1,tuples(j)._1),tuples(i)._2.toDouble * tuples(j)._2.toDouble)

如果用于输入的列表很大，那么 combine 函数会产生大量的映射键，这就是抛出异常的地方 .

在Hadoop Map Reduce设置中，我没有遇到任何问题，因为这是 combine 函数产生的点是Hadoop将映射对写入磁盘的点 . Spark似乎会将所有内容保留在内存中，直到它被 java.lang.OutOfMemoryError: GC overhead limit exceeded 爆炸 .

我可能正在做一些非常基本的错误，但我找不到任何关于如何从这个方面挺身而出的指示，我想知道如何避免这种情况 . 由于我是Scala和Spark的总菜鸟，我不确定问题是来自一个还是来自另一个，或两者兼而有之 . 我目前正在尝试在我自己的笔记本电脑上运行这个程序，它适用于 tuples 数组的长度不是很长的输入 . 提前致谢 .

4 回答

13
正如已经建议的那样，调整内存可能是一个很好的方法，因为这是一种昂贵的操作，可以以丑陋的方式扩展 . 但也许一些代码更改将有所帮助 .

您可以在组合函数中采用不同的方法，通过使用 combinations 函数来避免 if 语句 . 在组合操作之前，我还将元组的第二个元素转换为双精度：
```
tuples.

    // Convert to doubles only once
    map{ x=>
        (x._1, x._2.toDouble)
    }.

    // Take all pairwise combinations. Though this function
    // will not give self-pairs, which it looks like you might need
    combinations(2).

    // Your operation
    map{ x=>
        (toKey(x{0}._1, x{1}._1), x{0}._2*x{1}._2)
    }
```
这将给出一个迭代器，您可以使用下游，或者，如果需要，可以使用 toList 转换为列表（或其他内容） .
回复于 2024-04-29T16:59:08+08:00
1
启动 spark-shell 或 spark-submit 时添加以下JVM arg：
```
-Dspark.executor.memory=6g
```
您还可以考虑在创建 SparkContext 实例时明确设置工作器数：

分布式群集

在 conf/slaves 中设置从站名称：
```
val sc = new SparkContext("master", "MyApp")
```
回复于 2024-04-29T16:59:08+08:00
9
在文档（http://spark.apache.org/docs/latest/running-on-yarn.html）中，您可以阅读如何配置执行程序和内存限制 . 例如：
```
--master yarn-cluster --num-executors 10 --executor-cores 3 --executor-memory 4g --driver-memory 5g  --conf spark.yarn.executor.memoryOverhead=409
```
memoryOverhead应该是执行程序内存的10％ .

编辑：修正了4096到409（以下评论指的是这个）
回复于 2024-04-29T16:59:08+08:00
12

当我将 spark.memory.fraction 增加到大于0.6的值时，此JVM垃圾回收错误在我的情况下可重复发生 . 因此最好将值保留为默认值以避免JVM垃圾回收错误 . https://forums.databricks.com/questions/2202/javalangoutofmemoryerror-gc-overhead-limit-exceede.html也建议这样做 .

有关 0.6 为 spark.memory.fraction 的最佳值的更多信息，请参见https://issues.apache.org/jira/browse/SPARK-15796 .

回复于 2024-04-29T16:59:08+08:00

为什么Spark会因java.lang.OutOfMemoryError而失败：超出GC开销限制？

4 回答

分布式群集

相关问题