DataFrame API如何依赖Spark中的RDD？-Java 学习之路

一些来源，如Mathei Zaharia的this Keynote: Spark 2.0 talk，提到Spark DataFrames是 Build 在RDD之上的 . 我在DataFrame类中找到了一些关于RDD的提及（在Spark 2.0中，我必须查看DataSet）;但我仍然非常了解这两个API如何在幕后绑定在一起 .

有人可以解释DataFrame如何扩展RDD吗？

1 回答

根据DataBricks文章Deep Dive into Spark SQL’s Catalyst Optimizer（请参阅在Spark SQL中使用Catalyst），RDD是Catalyst构建的物理计划的元素 . 因此，我们根据DataFrames描述查询，但最终，Spark在RDD上运行 .

Catalyst workflow

此外，您可以使用 EXPLAIN 指令查看查询的物理计划 .

//  Prints the physical plan to the console for debugging purpose
auction.select("auctionid").distinct.explain()

// == Physical Plan ==
// Distinct false
// Exchange (HashPartitioning [auctionid#0], 200)
//  Distinct true
//   Project [auctionid#0]
 //   PhysicalRDD   //[auctionid#0,bid#1,bidtime#2,bidder#3,bidderrate#4,openbid#5,price#6,item#7,daystolive#8], MapPartitionsRDD[11] at mapPartitions at ExistingRDD.scala:37

回复于 2024-04-20T16:59:20+08:00

DataFrame API如何依赖Spark中的RDD？

1 回答

相关问题