我想用Python从数千个PDF文件中收集文本。从PDF提取文本工作正常,但是我的代码在执行期间因以下错误而随机停止(并非每次都停止在同一PDF上):
http.client.RemoteDisconnected: Remote end closed connection without response
我正在使用urllib。我想知道如何避免此错误,以及如何无法捕获该错误(即使except:
也不起作用)
我使用的代码:
df = pd.read_csv(csv_path, sep=";", error_bad_lines=False)
for i,row in df.iterrows():
print(row['year'], "- adding ",row['title'])
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
try:
row['fullarticle'] = convert_pdf_to_txt("_tmp.pdf")
except TypeError:
row['fullarticle'] = ""
pass
os.remove("_tmp.pdf")
print("Done. Saving csv...")
df.to_csv("my_structured_articles.csv")
print("Done. Head(10) : ")
print(df.head(10))
return df
答案 0 :(得分:0)
答案 1 :(得分:0)
首先,您应该将request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
放在try catch块下。
第二,如果问题仅是由于网络引起的。您应该多次使用retry。像这样:
retry = MAX_TRIES
while retry != 0:
try:
request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
break
except http.client.RemoteDisconnected:
retry -= 1