我试图使用Spark的JDBC访问存储在远程集群上的表(ORC格式):
val jdbcDF = spark.read
.format("jdbc")
.option("url", url)
.option("dbtable", "metrics")
.option("user", user)
.option("password", password)
.load()
但是,无论我做什么,我都会收到此错误:
引起:java.sql.SQLException:无法将第2列转换为long:java.lang.NumberFormatException:对于输入字符串:org.apache.hive.jdbc.HiveBaseResultSet.getLong中的“metrics.t”(HiveBaseResultSet.java:372 )org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 8.apply(JdbcUtils.scala:365)at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anonfun $ org $ apache $ spark $ sql $ execution $ datasources $ jdbc $ JdbcUtils $$ makeGetter $ 8.apply(JdbcUtils.scala:364)at org . apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils.scala:286)at org.apache.spark.sql.execution.datasources.jdbc.JdbcUtils $$ anon $ 1.getNext(JdbcUtils .scala:268)org.apache.spark.util.NextIterator.hasNext(NextIterator.scala:73)位于org.apache.spark的org.apache.spark.util.CompletionIterator.hasNext(CompletionIterator.scala:32) . sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知来源)在org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$ anon $ 1.hasNext(WholeStageCodegenExec.scala:377) at org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:231)at org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.apply(SparkPlan.scala:225) at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ apply $ 25.apply(RDD.scala:826)at org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)处申请$ 25.apply(RDD.scala:826) org.apache.spark.rdd.RDD.iterator(RDD.scala:287)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at org.apache.spark.scheduler.Task.run (Task.scala:99)在java.util.concurrent.ThreadPoolExe的org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:282) java.util.ThreadPoolExecutor中的cutor.runWorker(ThreadPoolExecutor.java:1142)java.lang.Thread.run(Thread.java:745)中的$ Worker.run(ThreadPoolExecutor.java:617)引起:java.lang .NumberFormatException:对于java.lang.Long.parseLong的java.lang.Long.parseLong(Long.java:589)java.lang.NumberFormatException.forInputString(NumberFormatException.java:65)的输入字符串:“metrics.t” (Long.java:631)org.apache.hive.jdbc.HiveBaseResultSet.getLong(HiveBaseResultSet.java:368)... 22更多
输入字符串“metrics.t”对应于表名和第二列的名称“t”,其中时间戳的长度为 .
如何跳过JDBC格式的 Headers ?
CSV选项(“ Headers ”,true)对我的情况没有影响 .
PS:Spark版本2.1.0
2 回答
代码不会抛出以下实现的任何异常:
奇怪的是,如果我删除空谓词
Array()
,则异常又回来了 .因为Spark JdbcDialect使用双引号作为quoteIdentifier,它不提供HiveDialect(不像MySQL) .
因此,Spark会通过JDBC将此类SQL发送到Hive:
select "some_column_name" from table
,而"some_column_name"
结果是字符串标量而不是列名 .val jdbcDF = spark.read.jdbc(jdbcUrl, "metrics", Array(), connectionProperties)
通过这行代码,你告诉Spark生成一个没有任何分区的JDBC DataFrame . 因此没有实际的数据获取SQL被发送到Hive,而Spark只给你一个空的DataFrame .唯一正确的方法是实现相应的方言:How to specify sql dialect when creating spark dataframe from JDBC?