Question

我正在尝试读取每小时后保存的 CSV 文件。有时会丢失文件，此代码会出错。如何跳过不存在的文件？

df_list = []
for day in range(1,int(getArgument("NUMBER_OF_DAYS"))+1,1):

  for hour in range(0,24,1):

    file_location"xxxxx/year="+getArgument("YEAR")+"/month="+getArgument("MONTH")+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"

    batch_df= spark.read.format("csv").option("header", "true").load(file_location)

    pandas_df = batch_df.toPandas()

    df_list.append(pandas_df)

final_pandas_df = pd.concat(df_list)

print(final_pandas_df.shape)

Answer 1

也许您可以在找不到文件时捕获异常：

from pyspark.sql.utils import AnalysisException

df_list = []
for day in range(1,int(getArgument("NUMBER_OF_DAYS"))+1,1):
  for hour in range(0,24,1):

    file_location"xxxxx/year="+getArgument("YEAR")+"/month="+getArgument("MONTH")+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"

    print(file_location)

    try:
        batch_df= spark.read.format("csv").option("header", "true").load(file_location)

        pandas_df = batch_df.toPandas()

        print(pandas_df.shape)

        batch_pandas_df_list.append(pandas_df)

    except AnalysisException as e:
        print(e)

final_batch_pandas_df = pd.concat(df_list)

print(pandas_df.shape)

Answer 2

您可以通过列出位置 xxxxx/year="+getArgument("YEAR")+"/month="+getArgument("MONTH") 下的文件来避免 for 循环，并仅过滤那些 dayofmonth 介于 1 和 NUMBER_OF_DAYS 之间的文件。然后将过滤后的文件列表传递给 spark.read.csv。

这是使用 Hadoop FileSystem API 的一种方式：

data_path = sc._gateway.jvm.org.apache.hadoop.fs.Path(
    "xxxxx/year=" + getArgument("YEAR") + "/month=" + getArgument("MONTH")
)
files = data_path.getFileSystem(sc._jsc.hadoopConfiguration()).listFiles(data_path, True)

filtered_files = []

# filter files that have dayofmonth in [1, NUMBER_OF_DAYS]
while files.hasNext():
    file_path = files.next().getPath().toString()
    dayofmonth = int(re.search(r".*/dayofmonth=(\d+)/.*", file_path).group(1))
    if dayofmonth <= getArgument("NUMBER_OF_DAYS"):
        filtered_files.append(file_path)

batch_df = spark.read.format("csv").option("header", "true").load(*filtered_files)
final_pandas_df = batch_df.toPandas()

读取多个CSV文件并跳过不存在的文件

2 个答案: