这是我在这个论坛的第一篇文章,我相信这个论坛会在这里回答我的基本问题。
我的要求包括两个步骤。
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Death Notice</SPAN></P>
类似于下面的html数据,我需要提取&#34;报纸&#34;值基于&#34;出版类型&#34; span和class为c8和c2
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
我尝试过的解决方案:
from bs4 import BeautifulSoup
import re
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>"""
soup = BeautifulSoup(data,'lxml')
doc=soup.find('span',class_='c8')
doctext=re.compile('<SPAN(.*DOCUMENT-TYPE: </SPAN><SPAN.*?)</SPAN>')
print(doctext.match(doc.text))
结果:
None
我应该只获得付费死亡通知作为结果
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
请帮助我解决问题。
注意:我已经在网上搜索并尝试了很多方法,但无法找到正确的解决方案,我最后在这里发帖,希望我可以为我的问题找到正确的解决方案。
答案 0 :(得分:0)
代码:
from bs4 import BeautifulSoup
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>"""
soup = BeautifulSoup(data,'lxml')
doc = soup.find('span',class_='c8')
print(doc.text)
结果:
DOCUMENT-TYPE:
答案 1 :(得分:0)
import re
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
"""
pattern="\<SPAN CLASS=\"c8\"\>DOCUMENT-TYPE: </SPAN><SPAN CLASS=\"c2\"\>(.*)\</SPAN>"
print [a.strip("*") for a in re.findall(pattern,data)]
输出:
['Paid Death Notice', 'Paid Notice: Deaths THORNTON, ROBERT']
答案 2 :(得分:0)
您可以使用re模块中的findall方法和正则表达式。
示例:
foo.people.unwrap_or_else(Vec::new).iter()
输出:None
您可以简单地获取res变量并获取所有键和值。 如果您想将结果转换为字典,可以使用以下代码:
import re
data = """<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">**Paid Death Notice**</SPAN>
<SPAN CLASS="c8">PUBLICATION-TYPE: </SPAN><SPAN CLASS="c2">Newspaper</SPAN>
<SPAN CLASS="c8">DOCUMENT-TYPE: </SPAN><SPAN CLASS="c2">Paid Notice: Deaths THORNTON, ROBERT</SPAN>
"""
data = data.replace('\n',' ')
res = re.findall("""<SPAN *CLASS="c8"> *([^:<]+): *</SPAN> *<SPAN *CLASS="c2">([^<]*)</SPAN>""",
data,
re.IGNORECASE
)
print res
print "\n".join([ "%s: %s" % (item[0],item[1]) for item in res ])
但在这种情况下,第一个&#39;文件类型&#39;最后一次发生将被覆盖:
[('DOCUMENT-TYPE', '**Paid Death Notice**'), ('PUBLICATION-TYPE', 'Newspaper'), ('DOCUMENT-TYPE', 'Paid Notice: Deaths THORNTON, ROBERT')]
DOCUMENT-TYPE: **Paid Death Notice**
PUBLICATION-TYPE: Newspaper
DOCUMENT-TYPE: Paid Notice: Deaths THORNTON, ROBERT
答案 3 :(得分:0)
不要混合正则表达式和BeautifulSoup,BS有足够的方法来导航DOM树:
if doc.text.startswith('DOCUMENT-TYPE'):
print doc.find_next_sibling().text
# prints **Paid Death Notice**
您还可以使用特定属性迭代所有标记:
for tag in soup.find_all('span', class_='c8'):
print tag.text
# DOCUMENT-TYPE:
# PUBLICATION-TYPE: