将文本文件转换为csv时遇到问题,其中一列包含空格和文本文件中的逗号

时间:2017-05-27 17:57:05

标签: python csv

我是python的新手,试图将下面的文本文件转换为csv文件。输入文本文件有2列,一列是id,另一列是名称,第二列可能包含逗号,数字和空格。

输入文件:

1134999 06Crazy Life
6821360 Pang Nakarin
10113088        Terfel, Bartoli- Mozart: Don
10151459        The Flaming Sidebur
6826647 Bodenstandig 3000
10186265        Jota Quest e Ivete Sangalo
6828986 Toto_XX (1977
10236364        U.S Bombs -
1135000 artist formaly know as Mat

我认为这可以解决:

  1. 用双引号分隔两列

    预期结果可能是:

    "1134999","04Crazy Life"
    "6821360","Pang Nakarin"
    "10113088","Terfel,Bartoli-Mozart: Don"
    
  2. 通过在第一个空格后分​​隔文件,然后在第二列中应用双引号。(因为ID列不包含任何空格/逗号)

    预期结果可能是:

    1134999,"04Crazy Life"
    6821360,"Pang Nakarin"
    10113088,"Terfel,Bartoli-Mozart: Don"
    
  3. 我尝试使用下面的代码对这两列进行双引号。但它引用了我不想要的每个被空格分隔的单词:

    import csv
    import itertools
    from StringIO import StringIO
    
    quotedData = StringIO()
    with open('demo.txt', 'r') as in_file:
        lines = in_file.read().splitlines()
        stripped = [line.replace(","," ").split() for line in lines]
        grouped = itertools.izip(*[stripped]*1)
        with open('try.csv', 'w') as out_file:
            writer = csv.writer(out_file, quotedData, quoting=csv.QUOTE_ALL)
            writer.writerow(('artist_id', 'artist_name'))
            for group in grouped:
                writer.writerows(group)
    

    结果:

    "artist_id","artist_name"
    "1134999","06Crazy","Life"
    "6821360","Pang","Nakarin"
    "10113088","Terfel","Bartoli-","Mozart:","Don"
    "10151459","The","Flaming","Sidebur"
    "6826647","Bodenstandig","3000"
    "10186265","Jota","Quest","e","Ivete","Sangalo"
    "6828986","Toto_XX","(1977"
    "10236364","U.S","Bombs","-"
    "1135000","artist","formaly","know","as","Mat"
    "10299728","Kassierer","-","Musik","für","beide","Ohren"
    

2 个答案:

答案 0 :(得分:0)

CSV表示“逗号分隔值”,因此根据定义,','用于区分列值。因此,您不能(以简单直接的方式)插入包含逗号的值。

或者,根据之后打开输出文件的方式,您可以使用除','之外的其他分隔符/分隔符,例如'\t'。 (也许可以将文件保存为.tsv)。

在Python中,您可以使用pandas轻松创建此类文件:

import pandas as pd

outputDataFrame = pd.DataFrame(grouped, columns=['artist_id', 'artist_name'])
outputDataFrame.to_csv('try.csv', sep='\t', index=False)

注意您不需要以这种方式从输入中手动删除任何','

答案 1 :(得分:0)

由于id似乎是严格的数字,看起来使用正则表达式似乎是一个很好的方法。 (请注意,以下假设您要从第二列的内容中删除前导空格。)

import re

with open('demo.txt', mode='r') as inp, open('try.csv', 'w') as outp:
    for line in inp:
        m = re.match(r'(\d+)\s+(.*)', line)
        outp.write('"{}","{}"\n'.format(m.group(1), m.group(2)))

运行后try.csv文件的内容:

"1134999","06Crazy Life"
"6821360","Pang Nakarin"
"10113088","Terfel, Bartoli- Mozart: Don"
"10151459","The Flaming Sidebur"
"6826647","Bodenstandig 3000"
"10186265","Jota Quest e Ivete Sangalo"
"6828986","Toto_XX (1977"
"10236364","U.S Bombs -"
"1135000","artist formaly know as Mat"