我正在尝试使用BeautifulSoup获取文章的所有段落,并排除那些没有该段落具有其他标签的段落标签,例如其中的标签,或者如果它们确实具有子标签,则只能获取文本该段。
这是HTML的一部分
<div class="entry-content clearfix">
<div class="entry-thumbnail>
<p> In as name to here them deny wise this. As rapid woody my he me which. </p>
<p> <a href="https://blabla"/> </p>
<p> Performed suspicion in certainty so frankness by attention pretended.
Newspaper or in tolerably education enjoyment. </p>
<p> <a href="https://blabla"/> When be draw drew ye. Defective in do recommend
suffering. House it seven in spoil tiled court. Sister others marked
fat missed did out use.</p>
</div>
这是我到目前为止所做的
contents = []
content = soup.find('div', { "class": "entry-content clearfix"}).find_all("p")
for p in content:
if not (p.find(findChildren("a"))):
contents[p] = content
if (content):
dic['content'] = content
else:
print("ARTICLE:", i, "HAS NO content")
dic['body'] = "No content"
答案 0 :(得分:0)
使用函数get_text()。它将从段落中提取文本。参考:https://www.pythonforbeginners.com/beautifulsoup/beautifulsoup-4-python
Performed suspicion in certainty so frankness by attention pretended.
Newspaper or in tolerably education enjoyment.
When be draw drew ye. Defective in do recommend
suffering. House it seven in spoil tiled court. Sister others marked
fat missed did out use.
结果:
LoadViewState