我正试图从www.bbc.co.uk/news
获得标题。我的代码工作正常,如下所示:
from bs4 import BeautifulSoup, SoupStrainer
import urllib2
import re
opener = urllib2.build_opener()
url = 'http://www.bbc.co.uk/news'
soup = BeautifulSoup(opener.open(url), "lxml")
titleTag = soup.html.head.title
print(titleTag.string)
titles = soup.find_all('span', {'class' : 'title-link__title-text'})
headlines = [t.text for t in titles]
print(headlines)
但我想建立一个特定日期的数据集,比如说2016年4月1日。但是头条新闻在白天不断变化,BBC没有保留历史记录。
所以我想从web archive
得到它。例如,我想从此url(http://web.archive.org/web/20160203074646/http://www.bbc.co.uk/news
)获取时间戳20160203074646
的标题。
当我在我的代码中粘贴网址时,输出包含标题。
修改
但是如何为所有时间戳自动执行此过程?
答案 0 :(得分:1)
要查看给定URL的所有快照,请将时间戳替换为星号:
然后屏幕刮了那个。
需要考虑的一些事项:
编辑:查看feedparser
docs
import feedparser
d = feedparser.parse('http://feeds.bbci.co.uk/news/rss.xml?edition=uk')
d.entries[0]
输出
{'guidislink': False,
'href': u'',
'id': u'http://www.bbc.co.uk/news/world-europe-37003819',
'link': u'http://www.bbc.co.uk/news/world-europe-37003819',
'links': [{'href': u'http://www.bbc.co.uk/news/world-europe-37003819',
'rel': u'alternate',
'type': u'text/html'}],
'media_thumbnail': [{'height': u'432',
'url': u'http://c.files.bbci.co.uk/12A34/production/_90704367_mediaitem90704366.jpg',
'width': u'768'}],
'published': u'Sun, 07 Aug 2016 21:24:36 GMT',
'published_parsed': time.struct_time(tm_year=2016, tm_mon=8, tm_mday=7, tm_hour=21, tm_min=24, tm_sec=36, tm_wday=6, tm_yday=220, tm_isdst=0),
'summary': u"Turkey's President Erdogan tells a huge rally in Istanbul that he would approve the return of the death penalty if it was backed by parliament and the public.",
'summary_detail': {'base': u'http://feeds.bbci.co.uk/news/rss.xml?edition=uk',
'language': None,
'type': u'text/html',
'value': u"Turkey's President Erdogan tells a huge rally in Istanbul that he would approve the return of the death penalty if it was backed by parliament and the public."},
'title': u'Turkey death penalty: Erdogan backs return at Istanbul rally',
'title_detail': {'base': u'http://feeds.bbci.co.uk/news/rss.xml?edition=uk',
'language': None,
'type': u'text/plain',
'value': u'Turkey death penalty: Erdogan backs return at Istanbul rally'}}