如何遍历抓取的数据,并将结果导出到CSV文件,每个字典作为Python中的新行?

时间:2016-11-17 15:03:33

标签: python dictionary

我对编码很新,并且正在搞乱如何将已删除的数据导出到CSV。

问题

我的脚本通过一组类似的页面,从每个页面中提取数据并将其存储到字典中。每个字典具有相同的键,但具有不同的值,即每个被抓取的页面具有与其相关联的字典,尽管键是相同的。

我想将单个词典(一旦被删除)导出到CSV文件,每个词典占用一行,但我很难弄清楚语法。

我是否需要创建词典字典?或者每个刮下的字典是否可以附加到单个CSV文件中?

干杯,

这是我到目前为止所做的:

papers = []
urls  = []
dict = {'Topics':0,'Link':0, 'Heading':0,"Summary Intro":0,"Summary Text":0, "Date":0}


for i in range(1,4): 
        
    url = str(a) + str(i) + str(c) 
    urls.append(url)
pprint(urls) 

for url in urls:
    print url    
    html = urllib2.urlopen(url).read()
    soup = BeautifulSoup(html)
    soup.find_all('div', class_="bp-paper-item commons")
    for link in soup.find_all('a', class_="title"):
        pdflist = []            
        pdflink1 = 'https://researchbriefings.parliament.uk' 
        pdflink2 = link.get('href')
        pdflink = pdflink1 + pdflink2
        x = str(pdflink)
        dict['Link'] = x
        pdfsoup = urllib2.urlopen(x).read() #opens link to the pdf
        pdfdata = BeautifulSoup(pdfsoup)
        for date in pdfdata.find_all('div', id="bp-published-date"):
            dict['Date'] = date.text.encode('utf').strip()
        for heading in pdfdata.find_all('h1'):
            dict['Heading'] = heading.text.encode('utf').strip()             
        for topics in pdfdata.find_all('div', id="bp-summary-metadata"):           
            dict['Topics'] = topics.text.encode('utf').strip()                  
        for downloadlink in pdfdata.find_all('div',id="bp-summary-fullreport"):
            dl = downloadlink.find('a', id="bp-summary-fullreport-link")
            print dl
        toCSV = [dict]
        keys = toCSV[0].keys()
        with open('UKPARL.csv', 'wb') as output_file:
            dict_writer = csv.DictWriter(output_file, keys)
            dict_writer.writeheader()
            dict_writer.writerows(toCSV)

0 个答案:

没有答案