Question

所以我想在用户提供网址输入时检索某些文章的段落。但是，大多数网站，如nytimes或谷歌新闻也有该文章下面的其他文章。或者他们在同一页面上有其他不相关的段落。我怎样才能规避这一点，以便它只会刮掉有人点击阅读的相关文章。

import urllib,sys
from bs4 import BeautifulSoup
import re
from prettyprint import prettyprint as pp 

html = urllib.urlopen(raw_input("Paste the URL of the website here:\n")).read()
soup = BeautifulSoup(html, 'lxml')
texts = soup.findAll("p", text=True)
paragraphs = soup.find('article').find("div", {'class': 'storyBody'}).find_all('p') 
 # This works for Google article, as the div class is storyBody so it scrapes
 # the relevant article only. 

def visible(element):
    if element.parent.name in ['style', 'script', '[document]', 'head', 'title']:
        return False
    elif re.match('<!--.*-->', str(element)):
        return False
    return True

visible_texts = filter(visible, texts)
print(visible_texts)

result = pp.pp_str(visible_texts)

示例网站：

http://www.zdnet.be/nieuws/188483/binnenkort-eindelijk-4g-internet-op-alle-vliegtuigen/

http://www.nu.nl/binnenland/4357771/in-hoger-beroep-in-zaak-nicole-van-hurk.html

有两种不同的格式。所以我正在寻找通用的东西来实现。

如何使用BeautifulSoup从不同的网站上搜索相关的文章文本？

0 个答案: