我有一个类似的Spark SQL查询:
SELECT
CAST(UNIX_TIMESTAMP('2015-03-29T01:11:23Z', "yyyy-MM-dd'T'HH:mm:ss'Z'") AS TIMESTAMP) as column1,
CAST(UNIX_TIMESTAMP('2015-03-29T02:11:23Z', "yyyy-MM-dd'T'HH:mm:ss'Z'") AS TIMESTAMP) as column2
FROM ...
第一列返回好的结果 . 但是第二列始终为NULL .
我错过了什么?
UPDATE :
完整代码(我删除了FROM子句以简化):
val spark = SparkSession.builder.appName("My Application").getOrCreate()
spark.sql("SELECT CAST(UNIX_TIMESTAMP('2015-03-29T01:11:23Z', \"yyyy-MM-dd'T'HH:mm:ss'Z'\") AS TIMESTAMP) as column1, CAST(UNIX_TIMESTAMP('2015-03-29T02:11:23Z', \"yyyy-MM-dd'T'HH:mm:ss'Z'\") AS TIMESTAMP) as column2").show(false)
完整的痕迹:
2018-07-20 10:26:52 WARN NativeCodeLoader:62 - Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
2018-07-20 10:26:52 INFO SparkContext:54 - Running Spark version 2.3.1
2018-07-20 10:26:52 INFO SparkContext:54 - Submitted application: My Application
2018-07-20 10:26:52 INFO SecurityManager:54 - Changing view acls to: XXX
2018-07-20 10:26:52 INFO SecurityManager:54 - Changing modify acls to: XXX
2018-07-20 10:26:52 INFO SecurityManager:54 - Changing view acls groups to:
2018-07-20 10:26:52 INFO SecurityManager:54 - Changing modify acls groups to:
2018-07-20 10:26:52 INFO SecurityManager:54 - SecurityManager: authentication disabled; ui acls disabled; users with view permissions: Set(XXX); groups with view permissions: Set(); users with modify permissions: Set(XXX); groups with modify permissions: Set()
2018-07-20 10:26:52 INFO Utils:54 - Successfully started service 'sparkDriver' on port 36615.
2018-07-20 10:26:52 INFO SparkEnv:54 - Registering MapOutputTracker
2018-07-20 10:26:52 INFO SparkEnv:54 - Registering BlockManagerMaster
2018-07-20 10:26:52 INFO BlockManagerMasterEndpoint:54 - Using org.apache.spark.storage.DefaultTopologyMapper for getting topology information
2018-07-20 10:26:52 INFO BlockManagerMasterEndpoint:54 - BlockManagerMasterEndpoint up
2018-07-20 10:26:52 INFO DiskBlockManager:54 - Created local directory at /tmp/blockmgr-d8b77ac8-53a3-4923-810e-d81539b30369
2018-07-20 10:26:52 INFO MemoryStore:54 - MemoryStore started with capacity 366.3 MB
2018-07-20 10:26:52 INFO SparkEnv:54 - Registering OutputCommitCoordinator
2018-07-20 10:26:52 INFO log:192 - Logging initialized @1074ms
2018-07-20 10:26:52 INFO Server:346 - jetty-9.3.z-SNAPSHOT
2018-07-20 10:26:52 INFO Server:414 - Started @1129ms
2018-07-20 10:26:52 INFO AbstractConnector:278 - Started ServerConnector@7544a1e4{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-07-20 10:26:52 INFO Utils:54 - Successfully started service 'SparkUI' on port 4040.
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1a6f5124{/jobs,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@781e7326{/jobs/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@22680f52{/jobs/job,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@39c11e6c{/jobs/job/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@324dcd31{/stages,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@503d56b5{/stages/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@72bca894{/stages/stage,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2575f671{/stages/stage/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@329a1243{/stages/pool,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@ecf9fb3{/stages/pool/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2d35442b{/storage,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@27f9e982{/storage/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@4593ff34{/storage/rdd,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@37d3d232{/storage/rdd/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@30c0ccff{/environment,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@581d969c{/environment/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@22db8f4{/executors,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2b46a8c1{/executors/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@1d572e62{/executors/threadDump,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@29caf222{/executors/threadDump/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@46cf05f7{/static,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7048f722{/,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@c074c0c{/api,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@58dea0a5{/jobs/job/kill,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2a2bb0eb{/stages/stage/kill,null,AVAILABLE,@Spark}
2018-07-20 10:26:52 INFO SparkUI:54 - Bound SparkUI to 0.0.0.0, and started at http://XXX:4040
2018-07-20 10:26:52 INFO SparkContext:54 - Added JAR file:XXX.jar at spark://XXX with timestamp 1532075212997
2018-07-20 10:26:53 INFO Executor:54 - Starting executor ID driver on host localhost
2018-07-20 10:26:53 INFO Utils:54 - Successfully started service 'org.apache.spark.network.netty.NettyBlockTransferService' on port 42981.
2018-07-20 10:26:53 INFO NettyBlockTransferService:54 - Server created on XXX:42981
2018-07-20 10:26:53 INFO BlockManager:54 - Using org.apache.spark.storage.RandomBlockReplicationPolicy for block replication policy
2018-07-20 10:26:53 INFO BlockManagerMaster:54 - Registering BlockManager BlockManagerId(driver, XXX, 42981, None)
2018-07-20 10:26:53 INFO BlockManagerMasterEndpoint:54 - Registering block manager XXX:42981 with 366.3 MB RAM, BlockManagerId(driver, XXX, 42981, None)
2018-07-20 10:26:53 INFO BlockManagerMaster:54 - Registered BlockManager BlockManagerId(driver, XXX, 42981, None)
2018-07-20 10:26:53 INFO BlockManager:54 - Initialized BlockManager: BlockManagerId(driver, XXX, 42981, None)
2018-07-20 10:26:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@66bfd864{/metrics/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:53 INFO SharedState:54 - Setting hive.metastore.warehouse.dir ('null') to the value of spark.sql.warehouse.dir ('file:XXX').
2018-07-20 10:26:53 INFO SharedState:54 - Warehouse path is 'file:XXX'.
2018-07-20 10:26:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@2dbd803f{/SQL,null,AVAILABLE,@Spark}
2018-07-20 10:26:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@3e48e859{/SQL/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@22d1886d{/SQL/execution,null,AVAILABLE,@Spark}
2018-07-20 10:26:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@7df60067{/SQL/execution/json,null,AVAILABLE,@Spark}
2018-07-20 10:26:53 INFO ContextHandler:781 - Started o.s.j.s.ServletContextHandler@748fe51d{/static/sql,null,AVAILABLE,@Spark}
2018-07-20 10:26:53 INFO StateStoreCoordinatorRef:54 - Registered StateStoreCoordinator endpoint
2018-07-20 10:26:55 INFO CodeGenerator:54 - Code generated in 238.939456 ms
2018-07-20 10:26:55 INFO CodeGenerator:54 - Code generated in 14.895801 ms
2018-07-20 10:26:55 INFO SparkContext:54 - Starting job: show at XXXTransform.scala:21
2018-07-20 10:26:55 INFO DAGScheduler:54 - Got job 0 (show at XXXTransform.scala:21) with 1 output partitions
2018-07-20 10:26:55 INFO DAGScheduler:54 - Final stage: ResultStage 0 (show at XXXTransform.scala:21)
2018-07-20 10:26:55 INFO DAGScheduler:54 - Parents of final stage: List()
2018-07-20 10:26:55 INFO DAGScheduler:54 - Missing parents: List()
2018-07-20 10:26:55 INFO DAGScheduler:54 - Submitting ResultStage 0 (MapPartitionsRDD[4] at show at XXXTransform.scala:21), which has no missing parents
2018-07-20 10:26:55 INFO MemoryStore:54 - Block broadcast_0 stored as values in memory (estimated size 7.3 KB, free 366.3 MB)
2018-07-20 10:26:55 INFO MemoryStore:54 - Block broadcast_0_piece0 stored as bytes in memory (estimated size 3.5 KB, free 366.3 MB)
2018-07-20 10:26:55 INFO BlockManagerInfo:54 - Added broadcast_0_piece0 in memory on XXX:42981 (size: 3.5 KB, free: 366.3 MB)
2018-07-20 10:26:55 INFO SparkContext:54 - Created broadcast 0 from broadcast at DAGScheduler.scala:1039
2018-07-20 10:26:55 INFO DAGScheduler:54 - Submitting 1 missing tasks from ResultStage 0 (MapPartitionsRDD[4] at show at XXXTransform.scala:21) (first 15 tasks are for partitions Vector(0))
2018-07-20 10:26:55 INFO TaskSchedulerImpl:54 - Adding task set 0.0 with 1 tasks
2018-07-20 10:26:55 INFO TaskSetManager:54 - Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 8067 bytes)
2018-07-20 10:26:55 INFO Executor:54 - Running task 0.0 in stage 0.0 (TID 0)
2018-07-20 10:26:55 INFO Executor:54 - Fetching spark://XXX.jar with timestamp 1532075212997
2018-07-20 10:26:55 INFO TransportClientFactory:267 - Successfully created connection to XXX/10.1.97.135:36615 after 23 ms (0 ms spent in bootstraps)
2018-07-20 10:26:55 INFO Utils:54 - Fetching spark://XXX.jar to /tmp/spark-853e3e5d-e228-473d-be50-c641d2b678a4/userFiles-7f3d8d8b-e9d4-4d9c-ac52-38e7fe59de7d/fetchFileTemp7750137601738248500.tmp
2018-07-20 10:26:55 INFO Executor:54 - Adding file:/tmp/spark-853e3e5d-e228-473d-be50-c641d2b678a4/userFiles-7f3d8d8b-e9d4-4d9c-ac52-38e7fe59de7d/XXX.jar to class loader
2018-07-20 10:26:55 INFO CodeGenerator:54 - Code generated in 5.345898 ms
2018-07-20 10:26:55 INFO Executor:54 - Finished task 0.0 in stage 0.0 (TID 0). 1139 bytes result sent to driver
2018-07-20 10:26:55 INFO TaskSetManager:54 - Finished task 0.0 in stage 0.0 (TID 0) in 178 ms on localhost (executor driver) (1/1)
2018-07-20 10:26:55 INFO TaskSchedulerImpl:54 - Removed TaskSet 0.0, whose tasks have all completed, from pool
2018-07-20 10:26:55 INFO DAGScheduler:54 - ResultStage 0 (show at XXXTransform.scala:21) finished in 0,321 s
2018-07-20 10:26:55 INFO DAGScheduler:54 - Job 0 finished: show at XXXTransform.scala:21, took 0,356248 s
+-------------------+-------+
|column1> > > |column2|
+-------------------+-------+
|2015-03-29 01:11:23|null |
+-------------------+-------+
2018-07-20 10:26:55 INFO SparkContext:54 - Invoking stop() from shutdown hook
2018-07-20 10:26:55 INFO AbstractConnector:318 - Stopped Spark@7544a1e4{HTTP/1.1,[http/1.1]}{0.0.0.0:4040}
2018-07-20 10:26:55 INFO SparkUI:54 - Stopped Spark web UI at http://XXX:4040
2018-07-20 10:26:55 INFO MapOutputTrackerMasterEndpoint:54 - MapOutputTrackerMasterEndpoint stopped!
2018-07-20 10:26:55 INFO MemoryStore:54 - MemoryStore cleared
2018-07-20 10:26:55 INFO BlockManager:54 - BlockManager stopped
2018-07-20 10:26:55 INFO BlockManagerMaster:54 - BlockManagerMaster stopped
2018-07-20 10:26:55 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint:54 - OutputCommitCoordinator stopped!
2018-07-20 10:26:55 INFO SparkContext:54 - Successfully stopped SparkContext
2018-07-20 10:26:55 INFO ShutdownHookManager:54 - Shutdown hook called
2018-07-20 10:26:55 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-1b68790c-9242-480e-a0b4-1b9d7068b27a
2018-07-20 10:26:55 INFO ShutdownHookManager:54 - Deleting directory /tmp/spark-853e3e5d-e228-473d-be50-c641d2b678a4
2 回答
我试图重现错误,我成功了 . 我通过发现什么日期有问题开始我的调查,并在执行一个小的for循环后,我结束了:
它返回:
因此,只有2 o 'clock was impacted. From there I dug into Spark'的datetimeExpressions.scala文件,其中进行日期时间内容的转换 . 更确切地说,在添加一些断点之后,我发现问题是由org.apache.spark.sql.catalyst.expressions.UnixTime#eval方法中的静默解析错误引起的,更确切地说:
并使用时区创建格式化程序:
接下来,我隔离了引入问题的代码:
实际上,我得到了一个ParseException:
从那里我检查了Javadoc https://docs.oracle.com/javase/8/docs/api/java/text/SimpleDateFormat.html#timezone的时区管理 . 似乎当时区表示为'Z'(ISO 8601时区)时,使用的正确模式选项是X.当我在隔离的测试用例中更改它时,日期格式正确:
它也适用于Spark的SQL(转换为欧洲/巴黎时区,你在
spark.sql.session.timeZone
选项中更改它):回顾一下,如果失败则返回null,我们可以从以下方面学习:
JIRA任务https://issues.apache.org/jira/browse/SPARK-8174 https://issues.apache.org/jira/browse/SPARK-8175
Hive https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF
和代码评论:
/ ** 将unix epoch(1970-01-01 00:00:00 UTC)的秒数转换为字符串,表示给定*格式的当前系统时区中该时刻的时间戳 . 如果缺少格式,请使用“1970-01-01 00:00:00”之类的格式 . *请注意,hive语言手册表示如果失败则返回0,但实际上它返回null . * /
我们可以使用类似ISO的时区选项(X)来解决它 .
这有点傻但是2015-03-29 02:00:00和2015-03-29 02:59:59之间的小时不存在!它与2015年3月29日发生的DST时间变化有关 .
例如,请参见:Timestamp 2015-03-29 02:00:00 can not be stored
如果在您的上下文中将所有时间戳都视为没有发生DST的UTC,则设置为:
spark.conf.set(“spark.sql.session.timeZone”,“UTC”),它会工作 .
Spark Strutured Streaming automatically converts timestamp to local time