Question

我写了这个使用BeautifulSoup的测试代码。

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('p'):
    print(n.get_text())

它工作正常，但它也检索不属于新闻文章的文本，例如发布时间，评论数量，版权等。

我希望它只能从新闻文章本身检索文本，怎么会这样呢？

Answer 1

您需要更具体地定位，而不仅仅是p标记。尝试寻找div class="article"或类似的东西，然后只从那里抓取段落

Answer 2

你可能会有更好的运气newspaper library专注于抓文章。

如果我们仅讨论BeautifulSoup，那么一个选项可以更接近理想的结果并拥有更多相关的段落，就是在具有div属性的itemprop="articleBody"元素的上下文中找到它们：

article_body = soup.find(itemprop="articleBody")
for p in article_body.find_all("p"):
    print(p.get_text())

Answer 3

更具体地说，您需要使用div class来抓住articleBody，所以：

import urllib.request
from bs4 import BeautifulSoup

url = "http://www.dailymail.co.uk/news/article-3795511/Harry-Potter-sale-half-million-pound-house-Iconic-Privet-Drive-market-suburban-Berkshire-complete-cupboard-stairs-one-magical-boy.html"    
html = urllib.request.urlopen(url).read()  
soup = BeautifulSoup(html,"lxml")
for n in soup.find_all('div', attrs={'itemprop':"articleBody"}):
    print(n.get_text())

对SO的回复不仅适合您，也适用于来自谷歌搜索等的人。正如您所看到的，attrs是一个字典，然后可以根据需要传递更多属性/值。

Python :( Beautifulsoup）如何将提取的文本从html新闻文章限制为仅限新闻文章。

3 个答案: