我对编码很新,并且正在搞乱如何将已删除的数据导出到CSV。
问题
我的脚本通过一组类似的页面,从每个页面中提取数据并将其存储到字典中。每个字典具有相同的键,但具有不同的值,即每个被抓取的页面具有与其相关联的字典,尽管键是相同的。
我想将单个词典(一旦被删除)导出到CSV文件,每个词典占用一行,但我很难弄清楚语法。
我是否需要创建词典字典?或者每个刮下的字典是否可以附加到单个CSV文件中?
干杯,
这是我到目前为止所做的:
papers = []
urls = []
dict = {'Topics':0,'Link':0, 'Heading':0,"Summary Intro":0,"Summary Text":0, "Date":0}
for i in range(1,4):
url = str(a) + str(i) + str(c)
urls.append(url)
pprint(urls)
for url in urls:
print url
html = urllib2.urlopen(url).read()
soup = BeautifulSoup(html)
soup.find_all('div', class_="bp-paper-item commons")
for link in soup.find_all('a', class_="title"):
pdflist = []
pdflink1 = 'https://researchbriefings.parliament.uk'
pdflink2 = link.get('href')
pdflink = pdflink1 + pdflink2
x = str(pdflink)
dict['Link'] = x
pdfsoup = urllib2.urlopen(x).read() #opens link to the pdf
pdfdata = BeautifulSoup(pdfsoup)
for date in pdfdata.find_all('div', id="bp-published-date"):
dict['Date'] = date.text.encode('utf').strip()
for heading in pdfdata.find_all('h1'):
dict['Heading'] = heading.text.encode('utf').strip()
for topics in pdfdata.find_all('div', id="bp-summary-metadata"):
dict['Topics'] = topics.text.encode('utf').strip()
for downloadlink in pdfdata.find_all('div',id="bp-summary-fullreport"):
dl = downloadlink.find('a', id="bp-summary-fullreport-link")
print dl
toCSV = [dict]
keys = toCSV[0].keys()
with open('UKPARL.csv', 'wb') as output_file:
dict_writer = csv.DictWriter(output_file, keys)
dict_writer.writeheader()
dict_writer.writerows(toCSV)