当我用通用线路模式(" rU")cdv.reader读取csv文件时,它会在csv.writer中生成\ r \ n作为新行。你知道如何忽略csv.writer中的新行吗?我不得不在阅读器中使用(" rU")因为我的文件包含换行符。
这是我使用的代码
import csv
dict={}
with open('training_data.csv','rU') as f:
reader = csv.reader(f,skipinitialspace=True)
for line in reader:
try:
dict[line[2]].append(line[3])
except:
dict[line[2]]=[line[3]]
with open('training_result.csv','w') as f:
writer = csv.writer(f, delimiter='|',dialect='excel-tab')
for key in dict:
writer.writerow([key,','.join(dict[key])])
输入就像这样
username, some of tweet that
want to be processed
by machine , label
因为这是换行符和激活的通用线路模式,当我抓住数据并想用csv编写器写它时会是相同的
我想要的输出是这样的
username, some of tweet that want to be processed by machine , label
我应该删除csv文件中的所有换行符吗?但它太大了,csv大约150MB,包含70万行。对此有什么办法吗?
我已经玩过阅读器属性,例如skipinitialspace和dialect,但仍然无法解决问题
答案 0 :(得分:1)
我认为这是您正在寻找的结果。你没有提到你的Python版本。这是Python 3.我使用了上传到Google云端硬盘的示例数据。该文件解析为UTF-8。
需要注意的关键事项:
csv
有一个DictReader
来帮助选择要处理的列。'rb'
或'wb'
,但在Python 3中,它意味着'r',newline=''
和open
调用的编码。line
将是{'标题':'值'}对的字典。extrasaction
告诉DictWriter
忽略未在fieldnames
中列出的词典中的额外字段。示例数据:
twitter.place.full_name,twitter.user.location,interaction.author.username,interaction.content,interaction.created_at
"Gunungsari, Lombok Barat",Indonesia,__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje,"Mon, 16 Jun 2014 15:32:54 +0000"
"Cakranegara, Kota Mataram",NULL,__Waone,Mataram,"Mon, 24 Mar 2014 13:13:46 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"perdana, my first nephew from my lil sibling sister,,,
*moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c","Sat, 04 Jan 2014 04:20:45 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari","Sat, 04 Jan 2014 06:15:52 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?,"Sat, 04 Jan 2014 05:55:04 +0000"
"Keruak, Lombok Timur",Jakarta,_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^ *kurang_apa_lagi","Thu, 02 Jan 2014 00:02:47 +0000"
"Pujut, Lombok Tengah",Jakarta,_5at,"Doäó»a bepergian keluar rumah:
""Bismillaahitawakkaltu äó»alallooh""
*pasrah-pasrah-pasrah;
*bandara_international_lombok","Sun, 05 Jan 2014 03:36:48 +0000"
"Sakra, Lombok Timur",Jakarta,_5at,"Time for riding with my lil bro:
Mataram - Senggigi - Gili Terawangan
*jenguk_ponakan_baru;
*very_early","Fri, 03 Jan 2014 22:09:26 +0000"
"Sukamulia, Lombok Timur",,1821922,Salam friend,"Sun, 20 Jul 2014 19:23:53 +0000"
代码:
import csv
# Python 2 version of opens
#with open('training_data.csv','rb') as inp:
# with open('training_result.csv','wb') as outp:
with open('training_data.csv','r',newline='',encoding='utf8') as inp:
with open('training_result.csv','w',newline='',encoding='utf8') as outp:
reader = csv.DictReader(inp)
writer = csv.DictWriter(outp,
fieldnames=['interaction.author.username','interaction.content'],
extrasaction='ignore')
writer.writeheader()
for line in reader:
line['interaction.content'] = line['interaction.content'].replace('\n',' ')
writer.writerow(line)
结果:
interaction.author.username,interaction.content
__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje
__Waone,Mataram
_5at,"perdana, my first nephew from my lil sibling sister,,, *moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c"
_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari"
_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?
_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^ *kurang_apa_lagi"
_5at,"Doäó»a bepergian keluar rumah: ""Bismillaahitawakkaltu äó»alallooh"" *pasrah-pasrah-pasrah; *bandara_international_lombok"
_5at,Time for riding with my lil bro: Mataram - Senggigi - Gili Terawangan *jenguk_ponakan_baru; *very_early
1821922,Salam friend
答案 1 :(得分:0)
我们可以通过“,”替换新行并为每个新附加添加新行来实现此目的。如果您不想要任何新行,可以删除\ n
dict[line[2]].append(line[3].replace("\n", ", "));
这是代码
import csv;
dict={};
with open('training_data.csv','rU') as f:
reader = csv.reader(f,skipinitialspace=True);
for line in reader:
try:
dict[line[2]].append("\n"+line[3].replace("\n", ", "));
except:
dict[line[2]]=[line[3].replace("\n", ", ")];
with open('training_result.csv','w') as f:
writer = csv.writer(f, delimiter=',',dialect='excel-tab');
for key in dict:
writer.writerow([key,','.join(dict[key])]);