Newspaper3k API文章下载()失败,HTTPSConnectionPool端口= 443读取超时。 (读取超时= 7)在URL上

时间:2020-07-23 18:49:01

标签: python python-3.x https timeout newspaper3k

在Firefox中浏览时,我可以看到http://www.chicagotribune.com/ct-florida-school-shooter-nikolas-cruz-20180217-story.html。但是,newspaper3k给了我这个错误:

Article download() failed with HTTPSConnectionPool(host='www.chicagotribune.com', port=443): Read timed out. (read timeout=7) on URL http://www.chicagotribune.com/ct-florida-school-shooter-nikolas-cruz-20180217-story.html

我的代码是:

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'
config = Config()

config.browser_user_agent = user_agent

url = "https://www.chicagotribune.com/nation-world/ct-florida-school-shooter-nikolas-cruz-20180217-story.html"

page = Article(url, config=config)


page.download()
page.parse()
print(page.text)

我认为类似'renewIPAddress()'的方法可能会有所帮助,但是我不确定如何在此代码中正确显示它。 https://stackoverflow.com/a/50496768/2414957

1 个答案:

答案 0 :(得分:1)

您可能已经解决了这个问题。您的代码可以正常工作,但是在某个准确的时间点导致“读取超时”的发生。我发现报纸连接有时会超时,因为它使用Python模块 requests。这些超时通常链接到您要查询的源。 news3k确实在Config()中支持超时参数,这有助于防止将来出现“读取超时”问题。

from newspaper import Article
from newspaper import Config

user_agent = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:78.0) Gecko/20100101 Firefox/78.0'

config = Config()
config.browser_user_agent = user_agent
config.request_timeout = 10

url = "https://www.chicagotribune.com/nation-world/ct-florida-school-shooter-nikolas-cruz-20180217-story.html"

page = Article(url, config=config)

page.download()
page.parse()
print(page.text)