Question

我寻找了类似的示例，但是所有示例在路径中都有一个特定的字符串，最后都有数字，因此能够迭代执行for循环。我的情况如下：我在多个分区中有多个实木复合地板文件，其路径类似于： s3a://path/idate=2019-09-16/part-{some random hex key1}.snappy.parquet s3a://path/idate=2019-09-16/part-{some random hex key2}.snappy.parquet etc...。 {some random hex key}显然是不可预测的，因此我无法在迭代代码定义中创建规则。我想要一个for循环，例如：

files="s3a://path/idate=2019-09-16/" 
for i in files
block{i}=spark.read.parquet(i)

其中block{i}是block1，block2等，并且是从s3a://path/idate=2019-09-16/part-{some random hex **key1,2, etc**..}.snappy.parquet创建的迭代数据帧

这有可能吗？

Answer 1

您可以使用读取files="s3a://path/idate=2019-09-16/"中的所有文件 df = spark.read.parquet(files)。

使用pyspark迭代加载多个Parquet文件

1 个答案: