Question

当我使用Pandas read_csv（）读取~35MB CSV时，我从CParser收到错误，可能是输入文件格式错误。示例如下，请参阅“PNCBANK，NATL”

行

UPDATE ----- 当我保存为Windows CSV而不是“逗号分隔”文件类型与'c'引擎时，它运行完全正常

我阅读了从所有观察中删除逗号的CSV样本，问题仍然存在。因此，下面字符串中出现的逗号不会导致此问题。

685 201603 N 204602 0 1 O 80 44 134000 80 4.125 R N FRM IL SF 61900 F116Q1000024 P 360 2其他卖家CENTRALMTGECO

776 201604 204603 0 1 O 46 47 108000 46 3.875 R N FRM CO SF 81200 F116Q1000025 C 360 1其他卖家USBANKNA

693 201603 203102 0 1 S 21 44 81000 21 3.25 R N FRM CO PU 81100 F116Q1000026 N 180 2其他卖家USBANKNA

715 201603 204602 0 1 S 75 46 63000 75 4.375 R N FRM CO CO 81100 F116Q1000027 P 360 1其他卖家PNCBANK，NATL

691 201603 204602 30460 0 1 O 24 14 35000 24 3.875 R N FRM KY SF 40300 F116Q1000028 N 360 1其他卖家其他服务商

758 201603 204602 0 2 I 75 36 85000 75 4.5 R N FRM KY SF 40300 F116Q1000029 P 360 2其他卖家USBANKNA

但是，当我尝试将引擎交换到Python引擎时，我得到一个readlines错误（下面的第二个错误）。

我相信这是因为文件中有一个列，其中包含字符串中偶尔出现逗号的字符串，文件分隔符也是逗号。事实上，如果这是问题，我怎么能用其他符号替换这些逗号，如果不是完全删除它们，同时保留文件的其余部分。我知道这些逗号是哪些字符串，因为它是该列观察的特定子集。谢谢！

read_csv（）

的C引擎出错

Traceback (most recent call last):
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 160, in <module>
    lender_by_msa = lender_PerformanceByMSA()
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 32, in lender_PerformanceByMSA
    date_col_fmt_dict={'firstPaymentDate': '%Y%m'}
  File "/Users/paltamura/Desktop/fmData/fmData/Load/load_loans.py", line 19, in load_data
    nrows=10000 if nrows == 'sample' else nrows
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
  File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.

用于read_csv（）的Python引擎的

readlines（）错误

Traceback (most recent call last):
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 160, in <module>
    lender_by_msa = lender_PerformanceByMSA()
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 32, in lender_PerformanceByMSA
    date_col_fmt_dict={'firstPaymentDate': '%Y%m'}
  File "/Users/paltamura/Desktop/fmData/fmData/Load/load_loans.py", line 20, in load_data
    engine='python'
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1608, in __init__
    self.columns, self.num_original_columns = self._infer_columns()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1823, in _infer_columns
    line = self._buffered_line()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1975, in _buffered_line
    return self._next_line()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 2006, in _next_line
    orig_line = next(self.data)
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

Answer 1

Python有一个替换命令： cleaning_string = cleaning_string.replace（“，”，“”）

上面的命令将用“”（无）或任何你想要的内容替换所有逗号。该字符串保持原样，但没有逗号。

在Pandas read_csv（）之前从CSV文件中删除逗号

1 个答案: