如何从具有特定标签的子标签中排除标签

时间:2019-06-05 20:36:32

标签: python html web-scraping beautifulsoup

我正在尝试使用BeautifulSoup获取文章的所有段落,并排除那些没有该段落具有其他标签的段落标签,例如其中的标签,或者如果它们确实具有子标签,则只能获取文本该段。

这是HTML的一部分

<div class="entry-content clearfix">
  <div class="entry-thumbnail>
  <p> In as name to here them deny wise this. As rapid woody my he me which. </p>
  <p> <a href="https://blabla"/> </p> 
  <p> Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. </p>
  <p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.</p>
</div>

这是我到目前为止所做的

 contents = []
 content = soup.find('div', { "class": "entry-content clearfix"}).find_all("p")
    for p in content:
        if not (p.find(findChildren("a"))):
            contents[p] = content
    if (content):
        dic['content'] = content
    else: 
        print("ARTICLE:", i, "HAS NO content")
        dic['body'] = "No content"

1 个答案:

答案 0 :(得分:0)

使用函数get_text()。它将从段落中提取文本。参考:https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python

Performed suspicion in certainty so frankness by attention pretended.
      Newspaper or in tolerably education enjoyment. 
  When be draw drew ye. Defective in do recommend
      suffering. House it seven in spoil tiled court. Sister others marked 
      fat missed did out use.

结果:

LoadViewState