所以我想在用户提供网址输入时检索某些文章的段落。但是,大多数网站,如nytimes或谷歌新闻也有该文章下面的其他文章。或者他们在同一页面上有其他不相关的段落。我怎样才能规避这一点,以便它只会刮掉有人点击阅读的相关文章。
import urllib,sys
from bs4 import BeautifulSoup
import re
from prettyprint import prettyprint as pp
html = urllib.urlopen(raw_input("Paste the URL of the website here:\n")).read()
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("p", text=True)
paragraphs = soup.find('article').find("div", {'class': 'storyBody'}).find_all('p')
# This works for Google article, as the div class is storyBody so it scrapes
# the relevant article only.
def visible(element):
if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
return False
elif re.match('<!--.*-->', str(element)):
return False
return True
visible_texts = filter(visible, texts)
print(visible_texts)
result = pp.pp_str(visible_texts)
示例网站:
http://www.zdnet.be/nieuws/188483/binnenkort-eindelijk-4g-internet-op-alle-vliegtuigen/
http://www.nu.nl/binnenland/4357771/in-hoger-beroep-in-zaak-nicole-van-hurk.html
有两种不同的格式。所以我正在寻找通用的东西来实现。