How do I avoid getting a sporadic KeyError: 'data' when using the Reddit API in python?

时间:2017-04-10 00:43:10

标签: python error-handling runtime reddit

I have the following python code that is working ok to use reddit's api and look up the front page of different subreddits and their rising submissions.

from pprint import pprint
import requests
import json
import datetime
import csv
import time

subredditsToScan = ["Arts", "AskReddit", "askscience", "aww", "books", "creepy", "dataisbeautiful", "DIY", "Documentaries", "EarthPorn", "explainlikeimfive", "food", "funny", "gaming", "gifs", "history", "jokes", "LifeProTips", "movies", "music", "pics", "science", "ShowerThoughts", "space", "sports", "tifu", "todayilearned", "videos", "worldnews"]

ofilePosts = open('posts.csv', 'wb')
writerPosts = csv.writer(ofilePosts, delimiter=',')

ofileUrls = open('urls.csv', 'wb')
writerUrls = csv.writer(ofileUrls, delimiter=',')

for subreddit in subredditsToScan:
    front = requests.get(r'http://www.reddit.com/r/' + subreddit + '/.json')
    rising = requests.get(r'http://www.reddit.com/r/' + subreddit + '/rising/.json')

    front.text
    rising.text

    risingData = rising.json()
    frontData = front.json()

    print(len(risingData['data']['children']))
    print(len(frontData['data']['children']))
    for i in range(0, len(risingData['data']['children'])):
        author = risingData['data']['children'][i]['data']['author']
        score = risingData['data']['children'][i]['data']['score']
        subreddit = risingData['data']['children'][i]['data']['subreddit']
        gilded = risingData['data']['children'][i]['data']['gilded']
        numOfComments = risingData['data']['children'][i]['data']['num_comments']
        linkUrl = risingData['data']['children'][i]['data']['permalink']
        timeCreated = risingData['data']['children'][i]['data']['created_utc']

        writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
        writerUrls.writerow([linkUrl])



    for j in range(0, len(frontData['data']['children'])):
        author = frontData['data']['children'][j]['data']['author'].encode('utf-8').strip()
        score = frontData['data']['children'][j]['data']['score']
        subreddit = frontData['data']['children'][j]['data']['subreddit'].encode('utf-8').strip()
        gilded = frontData['data']['children'][j]['data']['gilded']
        numOfComments = frontData['data']['children'][j]['data']['num_comments']
        linkUrl = frontData['data']['children'][j]['data']['permalink'].encode('utf-8').strip()
        timeCreated = frontData['data']['children'][j]['data']['created_utc']

        writerPosts.writerow([author, score, subreddit, gilded, numOfComments, linkUrl, timeCreated])
        writerUrls.writerow([linkUrl])

It works well and scrapes the data accurately but it constantly gets interrupted, seemingly randomly, and has a run time crash, saying:

Traceback (most recent call last):
  File "dataGather1.py", line 27, in <module>
    for i in range(0, len(risingData['data']['children'])):
KeyError: 'data'

I have no idea why this error is occuring on and off and not consistently. I thought maybe I am calling the API too much so it stops me from accessing it so I threw a sleep in my code but that did not help. Any ideas?

3 个答案:

答案 0 :(得分:1)

When there are no data on the response from the API there are is no key data on the dictionary so you get a keyError on some subreddits. You need to use a try catch

答案 1 :(得分:0)

The json you are parsing doesn't contain the 'data' element. Thus you get an error. I think your hunch is correct though. It is probably rate limiting, or that you're asking for hidden/deleted entries.

Reddit is very strict about accessing their API without playing nice. Meaning you should register your app and use a meaningful user-agent to your requets, and you should probably use the python library for this kind of thing: https://praw.readthedocs.io/en/latest/

Without registering it seems to my experience that the direct REST reddit API is even more strict than the 1 request per 2 seconds rule they have (had?).

答案 2 :(得分:0)

Python raises a KeyError whenever a dict() object is requested (using the format a = adict[key]) and the key is not in the dictionary.

It seems like when you are getting this error, your data value is empty.

You might just try to get the length of the dictionary before you execute the for loop. If it’s empty, it will just not run. Some interesting error checking here might help.

size = len(risingData)
if size:
    for i in range(0,size):
    …