使用Python分割csv文件时,文本限定符放错了位置

时间:2018-07-03 04:51:57

标签: python csv

我正在尝试将较大的csv文件拆分为较小的块,并将数据加载到sql中,以对块进行进一步分析。但是,当我运行以下代码时,文本限定符放错了位置,并且妨碍了csv文件,因此我们无法加载数据:

import csv

divisor = 500000

outfileno = 1
outfile = None

with open('mock_data.txt', 'r') as infile:
    infile_iter = csv.reader(infile)
    header = next(infile_iter)
    for index, row in enumerate(infile_iter):
        if index % divisor == 0:
            if outfile is not None:
                outfile.close()
            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w')
            outfileno += 1
            writer = csv.writer(outfile)
            writer.writerow(header)
        writer.writerow(row)
    # Don't forget to close the last file
    if outfile is not None:
        outfile.close()

即使脚本对于较小的嘲笑数据集(低于1000行)可以正常运行,但对于大型数据集也无法正常运行。可以说数据集如下:

"col1"  "col2"  "col3"  "col4"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"

当我运行脚本时,会生成较小的块,如下所示:

"col1   ""col2""    ""col3""    ""col4"""
"100    ""0100""    ""4900236731""  ""2019"""
"100    ""0100""    ""4900236731""  ""2019"""
"100    ""0100""    ""4900236731""  ""2019"""

文本限定符放错了位置。有什么办法吗? 请注意:我曾尝试使用其他代码来拆分数据,但其他代码和数据的问题相同。

1 个答案:

答案 0 :(得分:1)

在Python 3.x中,您应该使用参数newline=''打开CSV文件。可以使用delimiter='\t'指定制表符分隔符。例如:

import csv

divisor = 500000
outfileno = 1
outfile = None

with open('mock_data.txt', 'r', newline='') as infile:
    infile_iter = csv.reader(infile, delimiter='\t')
    header = next(infile_iter)

    for index, row in enumerate(infile_iter):
        if index % divisor == 0:
            if outfile:
                outfile.close()

            outfilename = 'big-{}.csv'.format(outfileno)
            outfile = open(outfilename, 'w', newline='')
            outfileno += 1
            writer = csv.writer(outfile, delimiter='\t', quoting=csv.QUOTE_ALL)
            writer.writerow(header)

        writer.writerow(row)

    # Don't forget to close the last file
    if outfile:
        outfile.close()

要强行引用所有字段,请使用quoting=csv.QUOTE_ALL。然后,将为您提供如下输出,其中所有字段都用双引号引起来,并用制表符分隔:

"col1"  "col2"  "col3"  "col4"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"
"100"   "0100"  "4900236731"    "2019"

这可以通过使用文本编辑器打开文件来验证。如果数据看起来不符合预期,则表明您的mock_data.txt文件存在问题。您需要提供一个指向较小示例的链接,该链接会重现问题。