Question

我想用Python从数千个PDF文件中收集文本。从PDF提取文本工作正常，但是我的代码在执行期间因以下错误而随机停止（并非每次都停止在同一PDF上）：

http.client.RemoteDisconnected: Remote end closed connection without response

我正在使用urllib。我想知道如何避免此错误，以及如何无法捕获该错误（即使except:也不起作用）

我使用的代码：

df = pd.read_csv(csv_path, sep=";", error_bad_lines=False)

for i,row in df.iterrows():
    print(row['year'], "- adding ",row['title'])
    request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
    try:
        row['fullarticle'] = convert_pdf_to_txt("_tmp.pdf")
    except TypeError:
        row['fullarticle'] = ""
        pass

os.remove("_tmp.pdf")
print("Done. Saving csv...")
df.to_csv("my_structured_articles.csv")
print("Done. Head(10) : ")
print(df.head(10))
return df

Answer 1

您需要在此处放try try块-

insert

您可以找到有关here例外的文档。

Answer 2

首先，您应该将request.urlretrieve(row['pdfarticle'],"_tmp.pdf")放在try catch块下。

第二，如果问题仅是由于网络引起的。您应该多次使用retry。像这样：

retry = MAX_TRIES
while retry != 0:
  try:
    request.urlretrieve(row['pdfarticle'],"_tmp.pdf")
    break
  except http.client.RemoteDisconnected:
    retry -= 1

下载PDF：远端封闭式连接，无响应

2 个答案: