Question

当我使用sc.textFile('*.txt')时，我会把所有东西都拿走。

我希望能够过滤掉几个文件。

e.g。如何处理除[＆＃39; bar.txt＆＃39;，＆＃39; foo.txt＆＃39;]以外的所有文件？

Answer 1

这是一种解决方法：

获取文件列表：

import os
file_list = os.popen('hadoop fs -ls <your dir>').readlines()

过滤它：

file_list = [x for x in file_list if (x not in ['bar.txt','foo.txt')
             and x[-3:]=='txt']

阅读：

rdd = sc.textFile(['<your dir>/'+x for x in file list])

Answer 2

PySpark在从S3读取多个文件时将跳过空的实木复合地板文件。在读取文件时使用S3A，它将跳过空文件。唯一的条件是必须有一些非空文件。不能都是空文件。

files_path = 's3a://my-buckket/obj1/obj2/data'
df = spark.read.parquet(files_path)