我正在尝试读取每小时后保存的 CSV 文件。有时会丢失文件,此代码会出错。如何跳过不存在的文件?
df_list = []
for day in range(1,int(getArgument("NUMBER_OF_DAYS"))+1,1):
for hour in range(0,24,1):
file_location"xxxxx/year="+getArgument("YEAR")+"/month="+getArgument("MONTH")+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"
batch_df= spark.read.format("csv").option("header", "true").load(file_location)
pandas_df = batch_df.toPandas()
df_list.append(pandas_df)
final_pandas_df = pd.concat(df_list)
print(final_pandas_df.shape)
答案 0 :(得分:0)
也许您可以在找不到文件时捕获异常:
from pyspark.sql.utils import AnalysisException
df_list = []
for day in range(1,int(getArgument("NUMBER_OF_DAYS"))+1,1):
for hour in range(0,24,1):
file_location"xxxxx/year="+getArgument("YEAR")+"/month="+getArgument("MONTH")+"/dayofmonth="+str(day)+"/hour="+str(hour)+"/*.csv"
print(file_location)
try:
batch_df= spark.read.format("csv").option("header", "true").load(file_location)
pandas_df = batch_df.toPandas()
print(pandas_df.shape)
batch_pandas_df_list.append(pandas_df)
except AnalysisException as e:
print(e)
final_batch_pandas_df = pd.concat(df_list)
print(pandas_df.shape)
答案 1 :(得分:0)
您可以通过列出位置 xxxxx/year="+getArgument("YEAR")+"/month="+getArgument("MONTH")
下的文件来避免 for 循环,并仅过滤那些 dayofmonth
介于 1 和 NUMBER_OF_DAYS
之间的文件。然后将过滤后的文件列表传递给 spark.read.csv
。
这是使用 Hadoop FileSystem API 的一种方式:
data_path = sc._gateway.jvm.org.apache.hadoop.fs.Path(
"xxxxx/year=" + getArgument("YEAR") + "/month=" + getArgument("MONTH")
)
files = data_path.getFileSystem(sc._jsc.hadoopConfiguration()).listFiles(data_path, True)
filtered_files = []
# filter files that have dayofmonth in [1, NUMBER_OF_DAYS]
while files.hasNext():
file_path = files.next().getPath().toString()
dayofmonth = int(re.search(r".*/dayofmonth=(\d+)/.*", file_path).group(1))
if dayofmonth <= getArgument("NUMBER_OF_DAYS"):
filtered_files.append(file_path)
batch_df = spark.read.format("csv").option("header", "true").load(*filtered_files)
final_pandas_df = batch_df.toPandas()