从网络存档获取头条新闻

时间:2016-08-06 16:07:12

标签: python python-2.7 web-scraping beautifulsoup

我正试图从www.bbc.co.uk/news获得标题。我的代码工作正常,如下所示:

from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re

opener = urllib2.build_opener()

url = 'http://www.bbc.co.uk/news'
soup = BeautifulSoup(opener.open(url), "lxml")

titleTag = soup.html.head.title

print(titleTag.string)

titles = soup.find_all('span', {'class' : 'title-link__title-text'})

headlines = [t.text for t in titles]

print(headlines)

但我想建立一个特定日期的数据集,比如说2016年4月1日。但是头条新闻在白天不断变化,BBC没有保留历史记录。

所以我想从web archive得到它。例如,我想从此urlhttp://web.archive.org/web/20160203074646/http://www.bbc.co.uk/news)获取时间戳20160203074646的标题。

当我在我的代码中粘贴网址时,输出包含标题。

修改

但是如何为所有时间戳自动执行此过程?

1 个答案:

答案 0 :(得分:1)

要查看给定URL的所有快照,请将时间戳替换为星号:

  

http://web.archive.org/web/*/http://www.bbc.co.uk

然后屏幕刮了那个。

需要考虑的一些事项:

  • Wayback API将为您提供最近的给定时间戳的单个快照。您似乎想要所有可用的快照,这就是我建议屏幕抓取的原因。
  • BBC可能会更快地改变标题,而Wayback Machine可以快照它们。
  • BBC提供的RSS feeds可以parsed更可靠。 “选择Feed”下方有一个列表。

编辑:查看feedparser docs

import feedparser
d = feedparser.parse('http://feeds.bbci.co.uk/news/rss.xml?edition=uk')
d.entries[0]

输出

{'guidislink': False,
 'href': u'',
 'id': u'http://www.bbc.co.uk/news/world-europe-37003819',
 'link': u'http://www.bbc.co.uk/news/world-europe-37003819',
 'links': [{'href': u'http://www.bbc.co.uk/news/world-europe-37003819',
            'rel': u'alternate',
            'type': u'text/html'}],
 'media_thumbnail': [{'height': u'432',
                      'url': u'http://c.files.bbci.co.uk/12A34/production/_90704367_mediaitem90704366.jpg',
                      'width': u'768'}],
 'published': u'Sun, 07 Aug 2016 21:24:36 GMT',
 'published_parsed': time.struct_time(tm_year=2016, tm_mon=8, tm_mday=7, tm_hour=21, tm_min=24, tm_sec=36, tm_wday=6, tm_yday=220, tm_isdst=0),
 'summary': u"Turkey's President Erdogan tells a huge rally in Istanbul that he would approve the return of the death penalty if it was backed by parliament and the public.",
 'summary_detail': {'base': u'http://feeds.bbci.co.uk/news/rss.xml?edition=uk',
                    'language': None,
                    'type': u'text/html',
                    'value': u"Turkey's President Erdogan tells a huge rally in Istanbul that he would approve the return of the death penalty if it was backed by parliament and the public."},
 'title': u'Turkey death penalty: Erdogan backs return at Istanbul rally',
 'title_detail': {'base': u'http://feeds.bbci.co.uk/news/rss.xml?edition=uk',
                  'language': None,
                  'type': u'text/plain',
                  'value': u'Turkey death penalty: Erdogan backs return at Istanbul rally'}}