我正在尝试向《华盛顿邮报》发出API请求,并提取与我的search query匹配的所有文章。
import requests
import json
import pandas as pd
#---------Define Parameters for API access
params = {
"count": "100",
"datefilter":"displaydatetime:[NOW/DAY-1YEAR TO NOW/DAY+1DAY]",
"facets.fields":"{!ex=include}contenttype,{!ex=include}name",
"highlight.fields":"headline,body",
"highlight.on":"true",
"highlight.snippets":"1",
"query":"coronavirus",
"sort":"displaydatetime desc",
"startat": "0",
"callback":"angular.callbacks._0"}
#----------Define Funktion
def WP_Scraper(url):
#-------------Define empty lists to be scraped
WP_title = []
WP_date = []
WP_article = []
WP_link = []
with requests.Session() as req:
for item in range(0, 9527, 100):
print(f"Extracting Article# {item +1}")
params["startat"] = item
r = req.get(url, params=params).json()
for loop in r['results']:
WP_title.append(loop['headline'])
WP_date.append(loop['pubdatetime'])
WP_link.append(loop['contenturl'])
WP_article.append(loop['blurb'])
#-------------Save in DF
df = pd.DataFrame()
df['title'] = WP_title
df['date'] = WP_date
df['article'] = WP_article
df['link']=WP_link
return df
WP_data = WP_Scraper("https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json")
有人知道导致错误的原因还是有更有效的方法?
我在stackoverflow上搜索了此答案。如果重复,请指出正确的方向。预先感谢。
答案 0 :(得分:1)
查看结果,将JSON包装在/**/angular.callbacks._0();
中。您应该在转换为JSON之前先剥离它,以便可以做类似的事情
r = json.loads(req.get(url, params=params).content.decode('utf-8').strip('/**/angular.callbacks._0();'))
在您的请求循环中。另外,您的嵌套循环与我在JSON结构中理解的有所不同,文章包含在documents
对中,并且blurb
仅在某些时候出现,因此请尝试
for loop in r['results']['documents']:
WP_title.append(loop['headline'])
WP_date.append(loop['pubdatetime'])
WP_link.append(loop['contenturl'])
try:
WP_article.append(loop['blurb'])
except KeyError:
pass