我试图将python报纸库(导入报纸......)与存储旧版网站的互联网存档(http://www.archive.org)一起使用。从理论上讲,这可以允许下载非常旧的新闻文章。
例如,
<div class="game">
<div class="statsBar">
<p id="score" class="stat">score: 0</p>
<p id="hp" class="stat">hp: 3</p>
</div>
<div class="hands">
<img id="enemyHand" src="paper.png">
<img id="playerHand" src="scissors.png">
</div>
<div class="buttons">
<a id="paper" class="clickable" onclick="document.getElementById('playerHand').src='paper.png'">Paper</a>
<a id="rock" class="clickable" onclick="document.getElementById('playerHand').src='rock.png'">Rock</a>
<a id="scissors" class="clickable" onclick="document.getElementById('playerHand').src='scissors.png'">Scissors</a>
</div>
</div>
虽然网站本身包含2016-12-01的实际新闻文章的链接,但报纸模块似乎并未接受它们。相反,你得到像
这样的网址import newspaper
url = 'http://web.archive.org/web/20161201123529/http://www.cnbc.com/'
paper = newspaper.build(url, memoize_articles = False )
这不是CNBC存档版本的实际文章。但是,报纸在今天的http://cnbc.com版本中效果很好。
我认为它因为url的格式(包含两个https://blog.archive.org/2016/10/23/defining-web-pages-web-sites-and-web-captures/
s)而感到困惑。有什么建议如何解开它?
答案 0 :(得分:1)
这是一个有趣的问题,我会将其添加到 GitHub 上的 Newspaper Usage Overview 文档中。
我尝试使用 newspaper.build,但无法正常工作,所以我使用了报纸 Source。
from time import sleep
from random import randint
from newspaper import Config
from newspaper import Source
USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()
config.browser_user_agent = USER_AGENT
config.request_timeout = 10
wayback_cnbc = Source(url='https://web.archive.org/web/20180301012621/https://www.cnbc.com/', config=config,
memoize_articles=False, language='en', number_threads=20, thread_timeout_seconds=2)
wayback_cnbc.build()
for article_extract in wayback_cnbc.articles:
article_extract.download()
article_extract.parse()
print(article_extract.publish_date)
print(article_extract.title)
print(article_extract.url)
print('')
# this sleep timer is helping with some timeout issues
# that were happening when querying
sleep(randint(1,3))
上面的例子输出这个:
None
Media
https://web.archive.org/web/20180301012621/https://www.cnbc.com/media/
None
CNBC Video
https://web.archive.org/web/20180301012621/https://www.cnbc.com/video/
2017-11-08 00:00:00
CNBC Healthy Returns
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2017/11/08/healthy-returns.html
2018-02-28 00:00:00
Markets in Asia decline as dollar steadies; Nikkei falls 307 points
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/asia-markets-stocks-dollar-and-china-caixin-pmi-in-focus.html
2018-02-28 00:00:00
S&P 500 rises, but on track to snap longest monthly win streak since 1959
https://web.archive.org/web/20180301012621/https://www.cnbc.com/2018/02/28/us-stocks-interest-rates-fed-markets.html
希望这个答案有助于您使用 WayBack Machine 查询文章。如果您有任何问题,请告诉我。