下面是一个html示例,但我的用例涉及不同类型的非结构化文本。将下面两个文本段落中的每一个与其父标题(SUMMARY1)联系起来(标签)的通用方法是什么?这里的标题实际上不是标题标记,但它只是一个粗体文本。我试图提取和识别文本段落及其相应的标题部分,无论标题是否真的是标准标题或类似下面的内容:
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Europe Test - Some stats</title>
<meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
<b><location">SUMMARY1</b>
</p>
<p>
This is a region in <location>Europe</location>
where the climate is good.
</p>
<p>
Total <location>Europe</location> population estimate was used back then.
</p>
<div class="aspNetHidden"></div>
</body>
</html>
我想尝试这样的JSON: {SUMMARY1:[&#39;这是一个气候良好的欧洲地区,&#39;当时使用的欧洲总人口估计数为&#39;]}
请指教。谢谢。
答案 0 :(得分:3)
使用BeautifulSoup应该是这样的:
from bs4 import BeautifulSoup
html = 'your html'
soup = BeautifulSoup(html)
header = soup.find('b')
print(header.text)
first_paragraph = header.findNext('p')
print(first_paragraph.text)
second_paragraph = first_paragraph.findNext('p')
print(second_paragraph.text)
答案 1 :(得分:3)
我最初考虑使用newspaper
module,但未能找到将SUMMARY1
作为&#34;摘要&#34;的唯一部分的方法。或&#34;描述&#34;或生成的Article
对象上的任何其他位置。无论如何,请查看此模块 - 可能真的可以帮助您解析HTML文章。
但是,如果使用BeautifulSoup
,您最初可能会找到标题,然后使用find_all_next()
获取下一个p
元素:
from bs4 import BeautifulSoup, NavigableString
import newspaper
html = """
<!doctype html>
<html lang="en">
<head>
<meta charset="utf-8">
<title>Europe Test - Some stats</title>
<meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
<b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
</p>
<p>
This is a region in <location>Europe</location>
where the climate is good.
</p>
<p>
Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
</p>
<div class="aspNetHidden"></div>
</body>
</html>"""
soup = BeautifulSoup(html, "lxml")
header = soup.find("b")
parts = [p.get_text(strip=True, separator=" ") for p in header.find_all_next("p")]
print({header.get_text(strip=True): parts})
打印:
{'SUMMARY1': [
'This is a region in Europe where the climate is good.',
'Total Europe population estimate was used back then.']}
答案 2 :(得分:1)
你也可以这样做:
from bs4 import BeautifulSoup
content = """
<html>
<div>
<p>
<b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
</p>
<p>
This is a region in <location>Europe</location>
where the climate is good.
</p>
<p>
Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
</p>
</div>
</html>
"""
soup = BeautifulSoup(content, "lxml")
items = soup.select("b")[0]
paragraphs = ' '.join([' '.join(data.text.split()) for data in items.find_parent().find_next_siblings()])
print({items.text : paragraphs})
输出:
{'SUMMARY1': 'This is a region in Europe where the climate is good. Total Europe population estimate was used back then.'}