我正在尝试获取在线评论(多个页面),提取每个评论的部分(标题,用户,文本......)并将该信息写入csv文件。是的,这些问题已被多次询问,但我找不到解决我的问题的问题:
首先我创造&在开头准备csv文件的列标题:
with open('review-raw-data.csv', 'wb') as output:
fieldnames = ['title', 'text', 'starRating', 'helpfulScore', 'date', 'user', 'id', 'url']
writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
哪个工作正常。后来我试图将提取的信息附加到该csv文件:
def extract(data):
with open('review-raw-data.csv', 'ab') as output:
fieldnames = ['title', 'text', 'starRating', 'helpfulScore', 'date', 'user', 'id', 'url']
writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, lineterminator='\n', quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
for review in data:
# extraction happening...
reviewobj = Review(title, text, helpfulscore, rating, date, user, reviewid, url)
writer.writerow({'title': reviewobj.title, 'text': reviewobj.text, 'starRating': reviewobj.rating,
'helpfulScore': reviewobj.helpfulscore, 'date': reviewobj.date, 'user': reviewobj.user,
'id': reviewobj.reviewid, 'url': reviewobj.url})
在收到每个评论页面后调用此函数。 所以这可能不是最聪明/最简单的方法,但它有效。 问题是,在第2次,第3次......时间调用此代码时,附加部分无法按预期工作,因为先前迭代中附加的所有行都会被覆盖。列标题仍然存在。
我想要的示例:(以','分隔的列)
title, user, id
title1, user1, id1
title2, user2, id2
title3, user3, id3
第二次迭代后我得到什么的示例:
title, user, id
title2, user2, id2 # row 1 is missing...
第3次迭代后我得到什么的示例:
title, user, id
title3, user3, id3 # rows 1 & 2 are missing...
我做错了什么?
答案 0 :(得分:1)
如果没有整个代码,并且不知道你如何调用该代码,就不可能确切地知道出了什么问题 - 但是你显然正在调用"创建&准备列标题"部分代码不止一次,因为以下工作符合预期:
bruno@bigb:~/Work/playground$ cat appcsv.py
import csv
with open('review-raw-data.csv', 'wb') as output:
fieldnames = ['a', 'b', 'c']
writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
writer.writeheader()
def extract(data):
with open('review-raw-data.csv', 'ab') as output:
fieldnames = ['a', 'b', 'c']
writer = csv.DictWriter(output, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
for row in data:
writer.writerow(dict(zip(fieldnames, row)))
dataset = [
[(1, 2, 3), (4, 5, 6)],
[(5, 6, 7),]
]
for data in dataset:
extract(data)
bruno@bigb:~/Work/playground$ python appcsv.py
bruno@bigb:~/Work/playground$ cat review-raw-data.csv
"a","b","c"
"1","2","3"
"4","5","6"
"5","6","7"
现在很容易避免覆盖现有文件:只需在打开它之前检查它是否存在:
import os
filename = 'review-raw-data.csv'
flag = "ab" if os.path.exists(filename) else "wb"
with open(filename, flag) as output:
# etc
作为旁注:您有很多重复的代码(fieldnames
定义,打开文件并创建DictWriter
)。你应该在函数中考虑这个因素,和/或只做一次这样的事情并将作者传递给extract
。
def get_writer(outfile):
fieldnames = [# etc ]
writer = csv.DictWriter(outfile, delimiter=',', fieldnames=fieldnames, quoting=csv.QUOTE_ALL, restval='unknown', extrasaction='ignore')
def extract(data, writer):
for review in data:
# extraction happening...
reviewobj = Review(title, text, helpfulscore, rating, date, user, reviewid, url)
writer.writerow({
'title': reviewobj.title, 'text': reviewobj.text,
'starRating': reviewobj.rating,
'helpfulScore': reviewobj.helpfulscore,
'date': reviewobj.date, 'user': reviewobj.user,
'id': reviewobj.reviewid, 'url': reviewobj.url
})
def main():
filename = 'review-raw-data.csv'
exists = os.path.exists(filename)
flag = "ab" if exists else "wb"
with open(filename) as outfile:
writer = get_writer(outfile)
if not exists:
writer.writeheaders()
for data in whereever_you_get_your_data_from():
extract(data, writer)