首页 文章

在嵌套的XML文件Spark scala中找到特定元素[重复]

提问于
浏览
0

这个问题在这里已有答案:

我想选择一个spaecific元素: select("File.columns.column._name")

|-- File: array (nullable = true)
 |    |-- element: struct (containsNull = true)
 |    |    |-- _Description: string (nullable = true)
 |    |    |-- _RowTag: string (nullable = true)
 |    |    |-- _name: string (nullable = true)
 |    |    |-- _type: string (nullable = true)
 |    |    |-- columns: struct (nullable = true)
 |    |    |    |-- column: array (nullable = true)
 |    |    |    |    |-- element: struct (containsNull = true)
 |    |    |    |    |    |-- _Hive_Final_Table: string (nullable = true)
 |    |    |    |    |    |-- _Hive_Final_column: string (nullable = true)
 |    |    |    |    |    |-- _Hive_Table1: string (nullable = true)
 |    |    |    |    |    |-- _Hive_column1: string (nullable = true)
 |    |    |    |    |    |-- _Path: string (nullable = true)
 |    |    |    |    |    |-- _Type: string (nullable = true)
 |    |    |    |    |    |-- _VALUE: string (nullable = true)
 |    |    |    |    |    |-- _name: string (nullable = true)

我收到了这个错误:

线程“main”中的异常org.apache.spark.sql.AnalysisException:由于数据类型不匹配,无法解析'File.columns.column [_name]':参数2需要整数类型,但是'_name'是字符串类型 . at org.apache.spark.sql.catalyst.analysis.package $ AnalysisErrorAt.failAnalysis(package.scala:42)at org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 1 $$ anonfun $ apply $ 2.applyOrElse(CheckAnalysis.scala:65)在org.apache.spark.sql.catalyst.analysis.CheckAnalysis $$ anonfun $ checkAnalysis $ 1 $$ anonfun $ apply $ 2.applyOrElse(CheckAnalysis.scala:57)at org.apache . spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala:335)at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ transformUp $ 1.apply(TreeNode.scala) :335)org.apache.spark.sql.catalyst.trees.CurrentOrigin $ .withOrigin(TreeNode.scala:69)at org.apache.spark.sql.catalyst.trees.TreeNode.transformUp(TreeNode.scala:334) at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 5.apply(TreeNode.scala:332)at org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 5.apply(TreeNode . scala:332)在scala.collection.tt上的org.apache.spark.sql.catalyst.trees.TreeNode $$ anonfun $ 4.apply(TreeNode.scala:281) erator $ anon $ 11.next(Iterator.scala:328)at scala.collection.Iterator $ class.foreach(Iterator.scala:727)at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)at scala.collection scala.collection.mutable.ArrayBuffer上的.generic.Growable $ class . $ plus $ plus $ eq(Growable.scala:48)scala.collection.mutable.ArrayBuffer . $ plus $ plus $ eq(ArrayBuffer.scala:103)at scala.collection.mutable.ArrayBuffer在scala.collection的scala.collection.TraversableOnce $ class.to(TraversableOnce.scala:273)scala.collection.AbstractIterator.to(Iterator.scala:1157)上的$ plus $ plus $ eq(ArrayBuffer.scala:47) .TraversableOnce $ class.toBuffer(TraversableOnce.scala:265)at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)scala.collection.TraversableOnce $ class.toArray(TraversableOnce.scala:252)at scala.collection . 位于org.apache.spark.sql.catalyst.trees.TreeNode.transformUp的org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)的AbstractIterator.toArray(Iterator.scala:1157) TreeNode.scala:332)在org.apache . 来自org.apache.spark.sql.catalyst.plans.QueryPlan.org的spark.sql.catalyst.plans.QueryPlan.transformExpressionUp $ 1(QueryPlan.scala:108)$ apache $ spark $ sql $ catalyst $ plans $ QueryPlan $$ recursiveTransform $ 2(QueryPlan.scala:118)

你能帮我吗 ?

2 回答

  • 0

    您需要explode函数来获取所需的列

    explode(列e)为给定数组或映射列中的每个元素创建一个新行 .

    val df1 = df.select(explode($"File").as("File")).select($"File.columns").as("column")
    

    首先,爆炸给你 column 字段

    val finalDF = df1.select(explode($"(column"))."column")).select($"column._name").as("_name")
    

    第二次爆炸会为您提供 _name

    希望这可以帮助!

  • 0

    查看您的模式,您可以执行以下操作从 dataframe 的嵌套结构中选择 _name

    import org.apache.spark.sql.functions._
    df.select(col("File.columns.column")(0)(0)("_name").as("_name"))
    

相关问题