Question

我需要从多个不是父目录或子目录的路径中读取镶木地板文件。

例如，

dir1 ---
       |
       ------- dir1_1
       |
       ------- dir1_2
dir2 ---
       |
       ------- dir2_1
       |
       ------- dir2_2

sqlContext.read.parquet(dir1)从dir1_1和dir1_2

读取镶木地板文件

现在我正在阅读每个目录并使用＆＃34; unionAll＆＃34;合并数据帧。有没有办法在不使用unionAll的情况下从dir1_2和dir2_1读取镶木地板文件，或者使用unionAll是否有任何奇特的方法

由于

Answer 1

有点晚了，但我在搜索时发现了这个，这可能对其他人有帮助......

您也可以尝试将参数列表解压缩到spark.read.parquet()

paths=['foo','bar']
df=spark.read.parquet(*paths)

如果您想将一些blob传递给路径参数，这很方便：

basePath='s3://bucket/'
paths=['s3://bucket/partition_value1=*/partition_value2=2017-04-*',
       's3://bucket/partition_value1=*/partition_value2=2017-05-*'
      ]
df=spark.read.option("basePath",basePath).parquet(*paths)

这很酷，因为您不需要列出basePath中的所有文件，并且仍然可以进行分区推断。

Answer 2

SQLContext的{{1}}方法和DataFrameReader的{{3}}方法都采用多条路径。所以这些都有效：

df = sqlContext.parquetFile('/dir1/dir1_2', '/dir2/dir2_1')

或

df = sqlContext.read.parquet('/dir1/dir1_2', '/dir2/dir2_1')

Answer 3

如果您有list个文件，可以执行以下操作：

files = ['file1', 'file2',...]
df = spark.read.parquet(*files)

Answer 4

对于 ORC

spark.read.orc("/dir1/*","/dir2/*")

spark进入dir1 /和dir2 /文件夹，并加载所有 ORC 文件。

对于实木复合地板，

spark.read.parquet("/dir1/*","/dir2/*")

Answer 5

刚刚接受John Conley的回答，并稍微修饰它并提供完整的代码（在Jupyter PySpark中使用），因为我发现他的答案非常有用。

from hdfs import InsecureClient
client = InsecureClient('http://localhost:50070')

import posixpath as psp
fpaths = [
  psp.join("hdfs://localhost:9000" + dpath, fname)
  for dpath, _, fnames in client.walk('/eta/myHdfsPath')
  for fname in fnames
]
# At this point fpaths contains all hdfs files 

parquetFile = sqlContext.read.parquet(*fpaths)


import pandas
pdf = parquetFile.toPandas()
# display the contents nicely formatted.
pdf

从Pyspark中的多个目录中读取镶木地板文件

5 个答案: