Question

我正试图通过纽约时报API（关于恐怖袭击的文章）在Python上创建一个文本文档集。

我知道NYP API不提供完整正文，但提供了我可以从中抓取文章的URL。因此，我们的想法是从API中提取“web_url”参数，从而刮掉整篇文章。

我正在尝试使用Python上的NYT API库：

from nytimesarticle import articleAPI

api = articleAPI("*Your Key*")
articles = api.search( q = 'terrorist attack')

print(articles['response'],['docs'],['web_url'])

但我无法提取“web_url”或文章。我得到的就是这个输出：

{'meta': {'time': 19, 'offset': 10, 'hits': 0}, 'docs': []} ['docs'] ['web_url']

Answer 1

nytimesarticle模块本身似乎存在问题。例如，请参阅以下内容：

>>> articles = api.search(q="trump+women+accuse", begin_date=20161001)
>>> print(articles)
{'response': {'docs': [], 'meta': {'offset': 0, 'hits': 0, 'time': 21}}, 'status': 'OK', 'copyright': 'Copyright (c) 2013 The New York Times Company.  All Rights Reserved.'}

但如果我使用requests（如模块中使用的那样）直接访问API，我会得到我正在寻找的结果：

>>> import requests
>>> r = requests.get("http://api.nytimes.com/svc/search/v2/articlesearch.json?q=trump+women+accuse&begin_date=20161001&api-key=XXXXX")
>>> data = r.json()
>>> len(data["response"]["docs"])
10

意味着返回了10篇文章（data的完整值为16kb，所以我不会在这里全部包含它）。与api.search()的响应形成对比，其中articles["response"]["docs"]为空列表。

nytimesarticle.py只有115行，因此调试非常简单。打印发送到API的URL的值显示：

>>> articles = api.search(q="trump+women+accuse", begin_date=20161001)
https://api.nytimes.com/svc/search/v2/articlesearch.json?q=b'trump+women+accuse'&begin_date=20161001&api-key=XXXXX
#                                                          ^^ THIS

offending code将每个字符串参数编码为UTF-8，这使其成为bytes个对象。这不是必需的，并且如上所示破坏了构造的URL。幸运的是，有pull request解决了这个问题：

>>> articles = api.search(q="trump+women+accuse", begin_date=20161001)
http://api.nytimes.com/svc/search/v2/articlesearch.json?begin_date=20161001&q=trump+women+accuse&api-key=XXXXX
>>> len(articles["response"]["docs"])
10

这也允许使用其他字符串参数，例如sort="newest"，因为字节格式化先前导致了错误。

Answer 2

print语句中的逗号分隔打印的内容。

你会想要这样的东西

articles['response']['docs']['web_url']

但是'docs': []既是数组又是空的，所以上面的行不起作用，所以你可以试试

articles = articles['response']['docs']
for article in articles:
    print(article['web_url'])

使用Python和New York Post API从纽约邮报中提取文章

2 个答案: