发出API请求时出现JSONDecodeError

时间:2020-04-14 09:45:12

标签: python json api web-scraping

我正在尝试向《华盛顿邮报》发出API请求,并提取与我的search query匹配的所有文章。

import requests
import json
import pandas as pd

#---------Define Parameters for API access
params = {
    "count": "100",
    "datefilter":"displaydatetime:[NOW/DAY-1YEAR TO NOW/DAY+1DAY]",
    "facets.fields":"{!ex=include}contenttype,{!ex=include}name",
    "highlight.fields":"headline,body",
    "highlight.on":"true",
    "highlight.snippets":"1",
    "query":"coronavirus",
    "sort":"displaydatetime desc",
    "startat": "0",
    "callback":"angular.callbacks._0"}

#----------Define Funktion
def WP_Scraper(url):
 #-------------Define empty lists to be scraped
    WP_title   = []
    WP_date   = []
    WP_article   = []
    WP_link = []
    
    with requests.Session() as req:
        for item in range(0, 9527, 100):
            print(f"Extracting Article# {item +1}")
            params["startat"] = item
            r = req.get(url, params=params).json()
            for loop in r['results']:
                WP_title.append(loop['headline'])
                WP_date.append(loop['pubdatetime'])
                WP_link.append(loop['contenturl'])
                WP_article.append(loop['blurb'])
                
 #-------------Save in DF                  
    df = pd.DataFrame()
    df['title'] = WP_title
    df['date'] = WP_date      
    df['article'] = WP_article 
    df['link']=WP_link
    return df  

WP_data = WP_Scraper("https://sitesearchapp.washingtonpost.com/sitesearch-api/v2/search.json")

调用该函数时出现以下错误: enter image description here

有人知道导致错误的原因还是有更有效的方法?

我在stackoverflow上搜索了此答案。如果重复,请指出正确的方向。预先感谢。

1 个答案:

答案 0 :(得分:1)

查看结果,将JSON包装在/**/angular.callbacks._0();中。您应该在转换为JSON之前先剥离它,以便可以做类似的事情

r = json.loads(req.get(url, params=params).content.decode('utf-8').strip('/**/angular.callbacks._0();'))
在您的请求循环中。另外,您的嵌套循环与我在JSON结构中理解的有所不同,文章包含在documents对中,并且blurb仅在某些时候出现,因此请尝试

for loop in r['results']['documents']:
    WP_title.append(loop['headline'])
    WP_date.append(loop['pubdatetime'])
    WP_link.append(loop['contenturl'])
    try:
        WP_article.append(loop['blurb'])
    except KeyError:
        pass