首页 文章

由Spark JDBC读取表头引起的java.lang.NumberFormatException

提问于
浏览
1

我试图使用Spark的JDBC访问存储在远程集群上的表(ORC格式):

val jdbcDF = spark.read
      .format("jdbc")
      .option("url", url)
      .option("dbtable", "metrics")
      .option("user", user)
      .option("password", password)
      .load()

但是,无论我做什么,我都会收到此错误:

引起:java.sql.SQLException:无法将第2列转换为long:java.lang.NumberFormatException:对于输入字符串:org.apache.hive.jdbc.HiveBaseResultSet.getLong中的“metrics.t”(HiveBaseResultSet.java:372 )org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 8.apply(JdbcUtils.scala:365)at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 8.apply(JdbcUtils.scala:364)at org . apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:286)at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils .scala:268)org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)位于org.apache.spark的org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) . sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知来源)在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:231)at org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:826)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)处申请$ 25.apply(RDD.scala:826) org.apache.spark.rdd.RDD.iterator(RDD.scala:287)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at org.apache.spark.scheduler.Task.run (Task.scala:99)在java.util.concurrent.ThreadPoolExe的org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282) java.util.ThreadPoolExecutor中的cutor.runWorker(ThreadPoolExecutor.java:1142)java.lang.Thread.run(Thread.java:745)中的$ Worker.run(ThreadPoolExecutor.java:617)引起:java.lang .NumberFormatException:对于java.lang.Long.parseLong的java.lang.Long.parseLong(Long.java:589)java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)的输入字符串:“metrics.t” (Long.java:631)org.apache.hive.jdbc.HiveBaseResultSet.getLong(HiveBaseResultSet.java:368)... 22更多

输入字符串“metrics.t”对应于表名和第二列的名称“t”,其中时间戳的长度为 .

如何跳过JDBC格式的 Headers ?

CSV选项(“ Headers ”,true)对我的情况没有影响 .

PS:Spark版本2.1.0

2 回答

  • 1

    代码不会抛出以下实现的任何异常:

    val jdbcUrl = s"jdbc:hive2://$jdbcHostname:$jdbcPort/$jdbcDatabase"
    
    val connectionProperties = new java.util.Properties()
    connectionProperties.setProperty("user", jdbcUsername)
    connectionProperties.setProperty("password", jdbcPassword)
    
    val jdbcDF = spark.read.jdbc(jdbcUrl, "metrics", Array(), connectionProperties)
    

    奇怪的是,如果我删除空谓词 Array() ,则异常又回来了 .

  • 0

    因为Spark JdbcDialect使用双引号作为quoteIdentifier,它不提供HiveDialect(不像MySQL) .

    因此,Spark会通过JDBC将此类SQL发送到Hive: select "some_column_name" from table ,而 "some_column_name" 结果是字符串标量而不是列名 .

    val jdbcDF = spark.read.jdbc(jdbcUrl, "metrics", Array(), connectionProperties) 通过这行代码,你告诉Spark生成一个没有任何分区的JDBC DataFrame . 因此没有实际的数据获取SQL被发送到Hive,而Spark只给你一个空的DataFrame .

    唯一正确的方法是实现相应的方言:How to specify sql dialect when creating spark dataframe from JDBC?

相关问题