csv reader中的通用新行模式使csv writer在文件中写入错误行

时间:2014-12-17 12:10:25

标签: python csv

当我用通用线路模式(" rU")cdv.reader读取csv文件时,它会在csv.writer中生成\ r \ n作为新行。你知道如何忽略csv.writer中的新行吗?我不得不在阅读器中使用(" rU")因为我的文件包含换行符。

这是我使用的代码

import csv

dict={}
with open('training_data.csv','rU') as f:
    reader = csv.reader(f,skipinitialspace=True)
for line in reader:
    try:
        dict[line[2]].append(line[3])
    except:
        dict[line[2]]=[line[3]]

with open('training_result.csv','w') as f:
writer = csv.writer(f, delimiter='|',dialect='excel-tab')
for key in dict:
    writer.writerow([key,','.join(dict[key])])

输入就像这样

username, some of tweet that
want to be processed
by machine , label

因为这是换行符和激活的通用线路模式,当我抓住数据并想用csv编写器写它时会是相同的

我想要的输出是这样的

username, some of tweet that want to be processed by machine , label

我应该删除csv文件中的所有换行符吗?但它太大了,csv大约150MB,包含70万行。对此有什么办法吗?

我已经玩过阅读器属性,例如skipinitialspace和dialect,但仍然无法解决问题

2 个答案:

答案 0 :(得分:1)

我认为这是您正在寻找的结果。你没有提到你的Python版本。这是Python 3.我使用了上传到Google云端硬盘的示例数据。该文件解析为UTF-8。

需要注意的关键事项:

  • csv有一个DictReader来帮助选择要处理的列。
  • 应以二进制模式打开CSV文件。在Python 2中,只有'rb''wb',但在Python 3中,它意味着'r',newline=''open调用的编码。
  • line将是{'标题':'值'}对的字典。
  • extrasaction告诉DictWriter忽略未在fieldnames中列出的词典中的额外字段。

示例数据:

twitter.place.full_name,twitter.user.location,interaction.author.username,interaction.content,interaction.created_at
"Gunungsari, Lombok Barat",Indonesia,__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje,"Mon, 16 Jun 2014 15:32:54 +0000"
"Cakranegara, Kota Mataram",NULL,__Waone,Mataram,"Mon, 24 Mar 2014 13:13:46 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"perdana, my first nephew from my lil sibling sister,,,

*moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c","Sat, 04 Jan 2014 04:20:45 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari","Sat, 04 Jan 2014 06:15:52 +0000"
"Pemenang, Lombok Utara",Jakarta,_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?,"Sat, 04 Jan 2014 05:55:04 +0000"
"Keruak, Lombok Timur",Jakarta,_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^  *kurang_apa_lagi","Thu, 02 Jan 2014 00:02:47 +0000"
"Pujut, Lombok Tengah",Jakarta,_5at,"Doäó»a bepergian keluar rumah:

""Bismillaahitawakkaltu äó»alallooh""

*pasrah-pasrah-pasrah;
*bandara_international_lombok","Sun, 05 Jan 2014 03:36:48 +0000"
"Sakra, Lombok Timur",Jakarta,_5at,"Time for riding with my lil bro:
Mataram - Senggigi - Gili Terawangan
*jenguk_ponakan_baru;
*very_early","Fri, 03 Jan 2014 22:09:26 +0000"
"Sukamulia, Lombok Timur",,1821922,Salam friend,"Sun, 20 Jul 2014 19:23:53 +0000"

代码:

import csv

# Python 2 version of opens
#with open('training_data.csv','rb') as inp:
#    with open('training_result.csv','wb') as outp:

with open('training_data.csv','r',newline='',encoding='utf8') as inp:
    with open('training_result.csv','w',newline='',encoding='utf8') as outp:
        reader = csv.DictReader(inp)
        writer = csv.DictWriter(outp,
                                fieldnames=['interaction.author.username','interaction.content'],
                                extrasaction='ignore')
        writer.writeheader()
        for line in reader:
            line['interaction.content'] = line['interaction.content'].replace('\n',' ')
            writer.writerow(line)

结果:

interaction.author.username,interaction.content
__Thasya__,At Sheraton Senggigi Beach Resort äóî https://t.co/1FdTsMYWje
__Waone,Mataram
_5at,"perdana, my first nephew from my lil sibling sister,,,  *moga gäó» ketularan songong kayak pamannya >_< http://t.co/UBEwcxWY5c"
_5at,"@indiraputeri udah pinter bahasa sasak nih skrng,,, inaq rari"
_5at,@indiraputeri dalemmm bgt nih ndoro .. !!! mksd nya apaan?
_5at,"pagi2, hujan, holiday, nasi goreng hangat, kopi hangat, di rumah, + spesial: kumpul keluarga,,, ^_^  *kurang_apa_lagi"
_5at,"Doäó»a bepergian keluar rumah:  ""Bismillaahitawakkaltu äó»alallooh""  *pasrah-pasrah-pasrah; *bandara_international_lombok"
_5at,Time for riding with my lil bro: Mataram - Senggigi - Gili Terawangan *jenguk_ponakan_baru; *very_early
1821922,Salam friend

答案 1 :(得分:0)

我们可以通过“,”替换新行并为每个新附加添加新行来实现此目的。如果您不想要任何新行,可以删除\ n

dict[line[2]].append(line[3].replace("\n", ", "));

这是代码

import csv;

dict={};
with open('training_data.csv','rU') as f:
    reader = csv.reader(f,skipinitialspace=True);
    for line in reader:
        try:
            dict[line[2]].append("\n"+line[3].replace("\n", ", "));
        except:
            dict[line[2]]=[line[3].replace("\n", ", ")];


with open('training_result.csv','w') as f:
    writer = csv.writer(f, delimiter=',',dialect='excel-tab');
    for key in dict:
        writer.writerow([key,','.join(dict[key])]);