将dict下载到CSV中通过Python

时间:2016-08-05 11:21:41

标签: python api csv

我目前正在尝试使用基于Python 2.7的API下载大量NY Times文章。为此,我能够重用我在网上找到的一段代码:

[code]from nytimesarticle import articleAPI
api = articleAPI('...')

articles = api.search( q = 'Brazil', 
     fq = {'headline':'Brazil', 'source':['Reuters','AP', 'The New York Times']}, 
     begin_date = '20090101' )

def parse_articles(articles):
    '''
    This function takes in a response to the NYT api and parses
    the articles into a list of dictionaries
    '''
    news = []
    for i in articles['response']['docs']:
        dic = {}
        dic['id'] = i['_id']
        if i['abstract'] is not None:
            dic['abstract'] = i['abstract'].encode("utf8")
        dic['headline'] = i['headline']['main'].encode("utf8")
        dic['desk'] = i['news_desk']
        dic['date'] = i['pub_date'][0:10] # cutting time of day.
        dic['section'] = i['section_name']
        if i['snippet'] is not None:
            dic['snippet'] = i['snippet'].encode("utf8")
        dic['source'] = i['source']
        dic['type'] = i['type_of_material']
        dic['url'] = i['web_url']
        dic['word_count'] = i['word_count']
        # locations
        locations = []
        for x in range(0,len(i['keywords'])):
            if 'glocations' in i['keywords'][x]['name']:
                locations.append(i['keywords'][x]['value'])
        dic['locations'] = locations
        # subject
        subjects = []
        for x in range(0,len(i['keywords'])):
            if 'subject' in i['keywords'][x]['name']:
                subjects.append(i['keywords'][x]['value'])
        dic['subjects'] = subjects   
        news.append(dic)
    return(news)

def get_articles(date,query):
    '''
    This function accepts a year in string format (e.g.'1980')
    and a query (e.g.'Amnesty International') and it will 
    return a list of parsed articles (in dictionaries)
    for that year.
    '''
    all_articles = []
    for i in range(0,100): #NYT limits pager to first 100 pages. But rarely will you find over 100 pages of results anyway.
        articles = api.search(q = query,
               fq = {'headline':'Brazil','source':['Reuters','AP', 'The New York Times']},
               begin_date = date + '0101',
               end_date = date + '1231',
               page = str(i))
        articles = parse_articles(articles)
        all_articles = all_articles + articles
    return(all_articles)

Download_all = []
for i in range(2009,2010):
    print 'Processing' + str(i) + '...'
    Amnesty_year =  get_articles(str(i),'Brazil')
    Download_all = Download_all + Amnesty_year

import csv
keys = Download_all[0].keys()
with open('brazil-mentions.csv', 'wb') as output_file:
    dict_writer = csv.DictWriter(output_file, keys)
    dict_writer.writeheader()
    dict_writer.writerows(Download_all)

没有最后一位(从&#34开始; ...导入csv"这似乎工作正常。如果我只是打印我的结果,(" print Download_all")我可以看到然而,它们以非结构化的方式运行。然而,运行实际的代码我得到的信息是:

  File "C:\Users\xxx.yyy\AppData\Local\Continuum\Anaconda2\lib\csv.py", line 148, in _dict_to_list
    + ", ".join([repr(x) for x in wrong_fields]))

ValueError: dict contains fields not in fieldnames: 'abstract'  

由于我是一个相当新手,我非常感谢您的帮助,指导我如何以结构化的方式将新闻文章下载到csv文件中。

提前多多感谢! 最好的问候

1 个答案:

答案 0 :(得分:0)

你在哪里:

keys = Download_all[0].keys()

这将从第一篇文章的字典中获取CSV的列标题。问题是文章词典并不都具有相同的键,因此当您到达第一个具有额外abstract键的词典时,它会失败。

看起来你会遇到abstractsnippet的问题,如果它们存在于回复中,它们只会添加到词典中。

您需要使keys等于所有可能键的超集:

keys = Download_all[0].keys() + ['abstract', 'snippet']

或者,确保每个字典都有每个字段的值:

def parse_articles(articles):
    ...
    if i['abstract'] is not None:
        dic['abstract'] = i['abstract'].encode("utf8")
    else:
        dic['abstract'] = ""
    ...
    if i['snippet'] is not None:
        dic['snippet'] = i['snippet'].encode("utf8")
    else:
        dic['snippet'] = ""