Question

我正在努力从一些使用BeautifulSoup4的几家公司的几个文件中提取一个特定的表格，这些文件包含了董事的签名。我的程序在保存表的部分上方找到一个标题，然后从该位置向下计算两个表以找到正确的表（文档是政府文档意味着该格式在几乎所有情况下都适用）。目前，我正是这样做的：

soup=BeautifulSoup(theDocument)

try:
   tables = soup.find(text=re.compile("Pursuant to the requirements of Section 13")).findNext('table').findNext('table').strings
except AttributeError as e:
   #deal with error, output failed URL to file

使用此代码，我在大约70％的搜索中找到了表格，但有些只是抛出错误。例如，this document是找不到表的那个之一（您可以通过对re.compile字符串执行CTRL + F来查找文档中的部分），但是this document来自同一家公司和看起来相同的HTML格式会产生积极的结果。

有什么想法吗？

编辑：＆amp; nbsp可能是一个问题，但也有另一个问题。缩短搜索字符串以不包括＆amp; nbsp仍然会导致失败。

EDIT2：有时会出现潜在的错误。我尝试将HTML打印出数据变量并获得以下内容：

<HTML><HEAD>
<TITLE>Access Denied</TITLE>
</HEAD><BODY>
<H1>Access Denied</H1>

You don't have permission to access "http&#58;&#47;&#47;www&#46;sec&#46;gov&#47;Archives&#47;edgar&#47;data&#47;1800&#47;000110465907013496&#47;a07&#45;1583&#95;110k&#46;htm" on this server.<P>
Reference&#32;&#35;18&#46;ee9a1645&#46;1466687980&#46;5cc0b4f
</BODY>
</HTML>

任何解决此问题的方法，同时仍然删除＆amp; nbsp？

编辑2：下面的答案确实解决了我所遇到的问题，因此我将其标记为已回答。也就是说，字符串中存在随机换行的另一个潜在问题，因此我修改了我的正则表达式以检查＆＃39; \ s +＆＃39;在所有单词之间而不仅仅是空格。 如果您遇到类似问题，请务必检查此代码的HTML代码。

Answer 1

问题是 和Section之间的13：

<font size="2">&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Pursuant to the requirements of Section&nbsp;13 or 15(d) of the Securities Exchange Act of 1934, Abbott Laboratories has duly caused
this report to be signed on its behalf by the undersigned, thereunto duly authorized. </font>

检查.text属性时，我会使用searching function和replace the   with a regular space：

import requests
from bs4 import BeautifulSoup


# url = "https://www.sec.gov/Archives/edgar/data/1800/000110465907013496/a07-1583_110k.htm"
url = "https://www.sec.gov/Archives/edgar/data/1800/000104746916010246/a2227279z10-k.htm"
response = requests.get(url, headers={
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
})

data = response.text
soup = BeautifulSoup(data, "lxml")

text_to_search = "Pursuant to the requirements of Section 13"
p = soup.find(lambda elm: elm.name == "p" and elm.text and text_to_search in elm.text.replace(u'\xa0', ' '))
tables = p.findNext('table').findNext('table').strings

美丽的汤桌刮一些时间只刮擦

1 个答案: