从Pyspark中的多个目录中读取镶木地板文件-Java 学习之路

我需要从多个不是父目录或子目录的路径中读取镶木地板文件 .

例如，

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1) 从dir1_1和dir1_2读取镶木地板文件

现在我正在阅读每个目录并使用"unionAll"合并数据帧 . 有没有一种方法来读取dir1_2和dir2_1拼花文件，而不使用 unionAll 或有使用任何花哨的方式 unionAll

谢谢

3 回答

6
有点晚了，但我在搜索时发现了这个，这可能对其他人有所帮助......

您也可以尝试将参数列表解压缩到 spark.read.parquet()
```
paths=['foo','bar']
df=spark.read.parquet(*paths)
```
如果要将一些blob传递给path参数，这很方便：
```
basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)
```
这很酷，因为您不需要列出basePath中的所有文件，并且仍然可以获得分区推断 .
回复于 2024-04-27T12:20:24+08:00
2
SQLContext 的parquetFile方法和 DataFrameReader 的parquet方法都采用多个路径 . 所以这些工作之一：
```
df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')
```
要么
```
df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')
```
回复于 2024-04-27T12:20:24+08:00

刚刚接受John Conley的回答，并对其进行了一些修饰并提供了完整的代码（在Jupyter PySpark中使用），因为我发现他的答案非常有用 .

from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')

import posixpath as psp
fpaths = [
  psp.join("hdfs://localhost:9000" + dpath, fname)
  for dpath, _, fnames in client.walk('/eta/myHdfsPath')
  for fname in fnames
]
# At this point fpaths contains all hdfs files 

parquetFile = sqlContext.read.parquet(*fpaths)


import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf

回复于 2024-04-27T12:20:24+08:00

从Pyspark中的多个目录中读取镶木地板文件

3 回答

相关问题