如何使用相应的标题对象识别html文本?

时间:2017-12-13 17:10:38

标签: html python-3.x beautifulsoup

下面是一个html示例,但我的用例涉及不同类型的非结构化文本。将下面两个文本段落中的每一个与其父标题(SUMMARY1)联系起来(标签)的通用方法是什么?这里的标题实际上不是标题标记,但它只是一个粗体文本。我试图提取和识别文本段落及其相应的标题部分,无论标题是否真的是标准标题或类似下面的内容:

<!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">

            <title>Europe Test  - Some stats</title>
            <meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
    <b><location">SUMMARY1</b>
    </p>
    <p>
      This is a region in <location>Europe</location>
      where the climate is good.
    </p>
    <p>
      Total <location>Europe</location> population estimate was used back then.
    </p>

<div class="aspNetHidden"></div>
        </body>
    </html>

我想尝试这样的JSON: {SUMMARY1:[&#39;这是一个气候良好的欧洲地区,&#39;当时使用的欧洲总人口估计数为&#39;]}

请指教。谢谢。

3 个答案:

答案 0 :(得分:3)

使用BeautifulSoup应该是这样的:

from bs4 import BeautifulSoup

html = 'your html'
soup = BeautifulSoup(html)
header = soup.find('b')
print(header.text)
first_paragraph = header.findNext('p')
print(first_paragraph.text)
second_paragraph = first_paragraph.findNext('p')
print(second_paragraph.text)

答案 1 :(得分:3)

我最初考虑使用newspaper module,但未能找到将SUMMARY1作为&#34;摘要&#34;的唯一部分的方法。或&#34;描述&#34;或生成的Article对象上的任何其他位置。无论如何,请查看此模块 - 可能真的可以帮助您解析HTML文章。

但是,如果使用BeautifulSoup,您最初可能会找到标题,然后使用find_all_next()获取下一个p元素:

from bs4 import BeautifulSoup, NavigableString
import newspaper


html = """
<!doctype html>
    <html lang="en">
        <head>
            <meta charset="utf-8">

            <title>Europe Test  - Some stats</title>
            <meta name="description" content="Watch videos and find the latest information.">
<body>
<p>
    <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
    </p>
    <p>
      This is a region in <location>Europe</location>
      where the climate is good.
    </p>
    <p>
      Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
    </p>

<div class="aspNetHidden"></div>
        </body>
    </html>"""

soup = BeautifulSoup(html, "lxml")
header = soup.find("b")
parts = [p.get_text(strip=True, separator=" ") for p in header.find_all_next("p")]
print({header.get_text(strip=True): parts})

打印:

{'SUMMARY1': [
     'This is a region in Europe where the climate is good.', 
     'Total Europe population estimate was used back then.']}

答案 2 :(得分:1)

你也可以这样做:

from bs4 import BeautifulSoup

content = """
<html>
    <div>
        <p>
            <b><location value="LS/us.de" idsrc="xmltag.org">SUMMARY1</b>
        </p>
        <p>
            This is a region in <location>Europe</location>
            where the climate is good.
        </p>
        <p>
            Total <location value="LS/us.de" idsrc="xmltag.org">Europe</location> population estimate was used back then.
        </p>
    </div>
</html>
"""
soup = BeautifulSoup(content, "lxml")
items = soup.select("b")[0]
paragraphs = ' '.join([' '.join(data.text.split()) for data in items.find_parent().find_next_siblings()])
print({items.text : paragraphs})

输出:

{'SUMMARY1': 'This is a region in Europe where the climate is good. Total Europe population estimate was used back then.'}