尝试分割csv文件并获取错误标记数据的数据

时间:2019-09-08 15:41:37

标签: python pandas

我正在尝试将一个csv文件拆分为多个csv,但保留csv标头。

我正在尝试的代码是:

import pandas as pd

chunk_size = 500000
batch_no = 1
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, ):
    chunk.to_csv(file_path + str(batch_no) + '.csv', index=False)
    batch_no += 1

我得到的错误是这个:

Traceback (most recent call last):
  File "splitcsv.py", line 5, in <module>
    for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, encoding='utf-8'):
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1128, in __next__

    return self.get_chunk()
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1188, in get_chunk
    return self.read(nrows=size)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 1154, in read
    ret = self._engine.read(nrows)
  File "/usr/local/lib/python3.7/site-packages/pandas/io/parsers.py", line 2059, in read
    data = self._reader.read(nrows)
  File "pandas/_libs/parsers.pyx", line 881, in pandas._libs.parsers.TextReader.read
  File "pandas/_libs/parsers.pyx", line 908, in pandas._libs.parsers.TextReader._read_low_memory
  File "pandas/_libs/parsers.pyx", line 950, in pandas._libs.parsers.TextReader._read_rows
  File "pandas/_libs/parsers.pyx", line 937, in pandas._libs.parsers.TextReader._tokenize_rows
  File "pandas/_libs/parsers.pyx", line 2132, in pandas._libs.parsers.raise_parser_error
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 274, saw 2

1 个答案:

答案 0 :(得分:0)

您可以尝试通过向error_bad_lines=False函数添加pd.read_csv参数来跳过产生错误的行。然后,您的代码将如下所示:

import pandas as pd

chunk_size = 500000
batch_no = 1
for chunk in pd.read_csv('/Users/illys/Desktop/thefinal.csv', chunksize=chunk_size, error_bad_lines=False):
    chunk.to_csv(file_path + str(batch_no) + '.csv', index=False)
    batch_no += 1