我创建了一个hive外部表,存储为由event_date Date分区的文本文件 .

在从Hive表中读取spark时,我们如何指定特定格式的csv?

环境是

1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)   
 2. Hive 1.1, CDH 5.5.1

斯卡拉脚本

sqlContext.setConf("hive.exec.dynamic.partition", "true")
    sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")

    val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3, 3))).toDF
    val distData_1 = distData.withColumn("event_date", current_date())
    distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int, event_date: date]

    scala > distData_1.show 
    + ---+---+---+----------+ 
    |_1 |_2 |_3 | event_date | 
    | 1 | 1 | 1 | 2016-03-25 | 
    | 2 | 2 | 2 | 2016-03-25 | 
    | 3 | 3 | 3 | 2016-03-25 | 

    distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")

    scala > sqlContext.sql("select * from part_table").show 
    | a    | b    | c    | event_date | 
    |1,1,1 | null | null | 2016-03-25 | 
    |2,2,2 | null | null | 2016-03-25 | 
    |3,3,3 | null | null | 2016-03-25 |

蜂巢表

create external table part_table (a String, b int, c bigint) 
    partitioned by (event_date Date)
    row format delimited fields terminated by ',' 
    stored as textfile  LOCATION "/user/hdfs/hive/part_table";

    select * from part_table shows
    |part_table.a | part_table.b | part_table.c | part_table.event_date |
    |1            |1             |1             |2016-03-25
    |2            |2             |2             |2016-03-25
    |3            |3             |3             |2016-03-25

看着hdfs

The path has 2 part files /user/hdfs/hive/part_table/event_date=2016-03-25
    part-00000
    part-00001

      part-00000 content
        1,1,1
      part-00001 content
        2,2,2
        3,3,3

附:如果我们将表存储为orc,它会按预期写入和读取数据 .

如果'字段终止'是默认值,那么Spark可以按预期读取数据,因此我猜这将是一个错误 .