Spark使用哪个内存部分来计算不会持久化的RDD-Java 学习之路

我是Spark的新手，我理解Spark将执行程序内存划分为以下几部分：

RDD Storage: 哪个Spark使用.persist（）或.cache（）来存储持久化的RDD，可以通过设置spark.storage.memoryFraction（默认为0.6）来定义

Shuffle and aggregation buffers: Spark用于存储随机输出 . 它可以使用spark.shuffle.memoryFraction定义 . 如果shuffle输出超过这个分数，那么Spark会将数据溢出到磁盘（默认为0.2）

User code: Spark使用此分数执行任意用户代码（默认为0.2）

为简单起见，我没有提到存储和洗牌安全分数 .

我的问题是，Spark使用哪个内存部分来计算和转换不会持久存在的RDD？例如：

lines = sc.textFile("i am a big file.txt")
count = lines.flatMap(lambda x: x.split(' ')).map(lambda x: (x, 1)).reduceByKey(add)

这里Spark不会立即加载整个文件，并会对输入文件进行分区，并在一个阶段中对每个分区进行所有这些转换 . 但是，Spark将用于加载分区行的内存部分，计算flatMap（）和map（）？

谢谢

Update:
上面显示的代码只是实际应用程序的一个子集，因为 count 使用 saveAsTextFile 保存，这将触发RDD计算 . 此外，我的问题是Spark的行为通用，而不是发布的示例

2 回答

0

这是我在Spark的邮件列表中从Andrew Or得到的答案：

这将是JVM中剩下的任何东西 . 这不是由存储或随机播放等部分明确控制的 . 但是，计算通常不需要使用那么多空间 . 根据我的经验，在洗牌过程中几乎总是缓存或聚合，这是内存最密集的 .

回复于 2024-04-20T11:39:33+08:00

来自spark官方指南：http://spark.apache.org/docs/latest/tuning.html#memory-management-overview

从上面的链接：

- spark.memory.fraction expresses the size of M as a fraction of the (JVM heap space - 300MB) (default 0.75). The rest of the space (25%) is reserved for user data structures, internal metadata in Spark, and safeguarding against OOM errors in the case of sparse and unusually large records.
- spark.memory.storageFraction expresses the size of R as a fraction of M (default 0.5). R is the storage space within M where cached blocks immune to being evicted by execution.

回复于 2024-04-20T11:39:33+08:00

Spark使用哪个内存部分来计算不会持久化的RDD

2 回答

相关问题