我是python的新手,试图将下面的文本文件转换为csv文件。输入文本文件有2列,一列是id,另一列是名称,第二列可能包含逗号,数字和空格。
输入文件:
1134999 06Crazy Life
6821360 Pang Nakarin
10113088 Terfel, Bartoli- Mozart: Don
10151459 The Flaming Sidebur
6826647 Bodenstandig 3000
10186265 Jota Quest e Ivete Sangalo
6828986 Toto_XX (1977
10236364 U.S Bombs -
1135000 artist formaly know as Mat
我认为这可以解决:
用双引号分隔两列
预期结果可能是:
"1134999","04Crazy Life"
"6821360","Pang Nakarin"
"10113088","Terfel,Bartoli-Mozart: Don"
通过在第一个空格后分隔文件,然后在第二列中应用双引号。(因为ID列不包含任何空格/逗号)
预期结果可能是:
1134999,"04Crazy Life"
6821360,"Pang Nakarin"
10113088,"Terfel,Bartoli-Mozart: Don"
我尝试使用下面的代码对这两列进行双引号。但它引用了我不想要的每个被空格分隔的单词:
import csv
import itertools
from StringIO import StringIO
quotedData = StringIO()
with open('demo.txt', 'r') as in_file:
lines = in_file.read().splitlines()
stripped = [line.replace(","," ").split() for line in lines]
grouped = itertools.izip(*[stripped]*1)
with open('try.csv', 'w') as out_file:
writer = csv.writer(out_file, quotedData, quoting=csv.QUOTE_ALL)
writer.writerow(('artist_id', 'artist_name'))
for group in grouped:
writer.writerows(group)
结果:
"artist_id","artist_name"
"1134999","06Crazy","Life"
"6821360","Pang","Nakarin"
"10113088","Terfel","Bartoli-","Mozart:","Don"
"10151459","The","Flaming","Sidebur"
"6826647","Bodenstandig","3000"
"10186265","Jota","Quest","e","Ivete","Sangalo"
"6828986","Toto_XX","(1977"
"10236364","U.S","Bombs","-"
"1135000","artist","formaly","know","as","Mat"
"10299728","Kassierer","-","Musik","für","beide","Ohren"
答案 0 :(得分:0)
CSV表示“逗号分隔值”,因此根据定义,','
用于区分列值。因此,您不能(以简单直接的方式)插入包含逗号的值。
或者,根据之后打开输出文件的方式,您可以使用除','
之外的其他分隔符/分隔符,例如'\t'
。 (也许可以将文件保存为.tsv
)。
在Python中,您可以使用pandas
轻松创建此类文件:
import pandas as pd
outputDataFrame = pd.DataFrame(grouped, columns=['artist_id', 'artist_name'])
outputDataFrame.to_csv('try.csv', sep='\t', index=False)
注意:您不需要以这种方式从输入中手动删除任何','
。
答案 1 :(得分:0)
由于id似乎是严格的数字,看起来使用正则表达式似乎是一个很好的方法。 (请注意,以下假设您要从第二列的内容中删除前导空格。)
import re
with open('demo.txt', mode='r') as inp, open('try.csv', 'w') as outp:
for line in inp:
m = re.match(r'(\d+)\s+(.*)', line)
outp.write('"{}","{}"\n'.format(m.group(1), m.group(2)))
运行后try.csv
文件的内容:
"1134999","06Crazy Life"
"6821360","Pang Nakarin"
"10113088","Terfel, Bartoli- Mozart: Don"
"10151459","The Flaming Sidebur"
"6826647","Bodenstandig 3000"
"10186265","Jota Quest e Ivete Sangalo"
"6828986","Toto_XX (1977"
"10236364","U.S Bombs -"
"1135000","artist formaly know as Mat"