我正在尝试将较大的csv文件拆分为较小的块,并将数据加载到sql中,以对块进行进一步分析。但是,当我运行以下代码时,文本限定符放错了位置,并且妨碍了csv文件,因此我们无法加载数据:
import csv
divisor = 500000
outfileno = 1
outfile = None
with open('mock_data.txt', 'r') as infile:
infile_iter = csv.reader(infile)
header = next(infile_iter)
for index, row in enumerate(infile_iter):
if index % divisor == 0:
if outfile is not None:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w')
outfileno += 1
writer = csv.writer(outfile)
writer.writerow(header)
writer.writerow(row)
# Don't forget to close the last file
if outfile is not None:
outfile.close()
即使脚本对于较小的嘲笑数据集(低于1000行)可以正常运行,但对于大型数据集也无法正常运行。可以说数据集如下:
"col1" "col2" "col3" "col4"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
当我运行脚本时,会生成较小的块,如下所示:
"col1 ""col2"" ""col3"" ""col4"""
"100 ""0100"" ""4900236731"" ""2019"""
"100 ""0100"" ""4900236731"" ""2019"""
"100 ""0100"" ""4900236731"" ""2019"""
文本限定符放错了位置。有什么办法吗? 请注意:我曾尝试使用其他代码来拆分数据,但其他代码和数据的问题相同。
答案 0 :(得分:1)
在Python 3.x中,您应该使用参数newline=''
打开CSV文件。可以使用delimiter='\t'
指定制表符分隔符。例如:
import csv
divisor = 500000
outfileno = 1
outfile = None
with open('mock_data.txt', 'r', newline='') as infile:
infile_iter = csv.reader(infile, delimiter='\t')
header = next(infile_iter)
for index, row in enumerate(infile_iter):
if index % divisor == 0:
if outfile:
outfile.close()
outfilename = 'big-{}.csv'.format(outfileno)
outfile = open(outfilename, 'w', newline='')
outfileno += 1
writer = csv.writer(outfile, delimiter='\t', quoting=csv.QUOTE_ALL)
writer.writerow(header)
writer.writerow(row)
# Don't forget to close the last file
if outfile:
outfile.close()
要强行引用所有字段,请使用quoting=csv.QUOTE_ALL
。然后,将为您提供如下输出,其中所有字段都用双引号引起来,并用制表符分隔:
"col1" "col2" "col3" "col4"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
"100" "0100" "4900236731" "2019"
这可以通过使用文本编辑器打开文件来验证。如果数据看起来不符合预期,则表明您的mock_data.txt
文件存在问题。您需要提供一个指向较小示例的链接,该链接会重现问题。