在Pandas read_csv()之前从CSV文件中删除逗号

时间:2017-05-19 14:47:06

标签: python csv pandas

当我使用Pandas read_csv()读取~35MB CSV时,我从CParser收到错误,可能是输入文件格式错误。示例如下,请参阅“PNCBANK,NATL”

UPDATE ----- 当我保存为Windows CSV而不是“逗号分隔”文件类型与'c'引擎时,它运行完全正常

我阅读了从所有观察中删除逗号的CSV样本,问题仍然存在。因此,下面字符串中出现的逗号不会导致此问题。

  

685 201603 N 204602 0 1 O 80 44 134000 80 4.125 R N FRM IL SF 61900 F116Q1000024 P 360 2其他卖家CENTRALMTGECO

776 201604 204603 0 1 O 46 47 108000 46 3.875 R N FRM CO SF 81200 F116Q1000025 C 360 1其他卖家USBANKNA

693 201603 203102 0 1 S 21 44 81000 21 3.25 R N FRM CO PU 81100 F116Q1000026 N 180 2其他卖家USBANKNA

715 201603 204602 0 1 S 75 46 63000 75 4.375 R N FRM CO CO 81100 F116Q1000027 P 360 1其他卖家PNCBANK,NATL

691 201603 204602 30460 0 1 O 24 14 35000 24 3.875 R N FRM KY SF 40300 F116Q1000028 N 360 1其他卖家其他服务商

758 201603 204602 0 2 I 75 36 85000 75 4.5 R N FRM KY SF 40300 F116Q1000029 P 360 2其他卖家USBANKNA

但是,当我尝试将引擎交换到Python引擎时,我得到一个readlines错误(下面的第二个错误)。

我相信这是因为文件中有一个列,其中包含字符串中偶尔出现逗号的字符串,文件分隔符也是逗号。事实上,如果这是问题,我怎么能用其他符号替换这些逗号,如果不是完全删除它们,同时保留文件的其余部分。我知道这些逗号是哪些字符串,因为它是该列观察的特定子集。谢谢!

read_csv()

的C引擎出错
Traceback (most recent call last):
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 160, in <module>
    lender_by_msa = lender_PerformanceByMSA()
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 32, in lender_PerformanceByMSA
    date_col_fmt_dict={'firstPaymentDate': '%Y%m'}
  File "/Users/paltamura/Desktop/fmData/fmData/Load/load_loans.py", line 19, in load_data
    nrows=10000 if nrows == 'sample' else nrows
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 325, in _read
    return parser.read()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 815, in read
    ret = self._engine.read(nrows)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1314, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 805, in pandas.parser.TextReader.read (pandas/parser.c:8748)
  File "pandas/parser.pyx", line 827, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:9003)
  File "pandas/parser.pyx", line 881, in pandas.parser.TextReader._read_rows (pandas/parser.c:9731)
  File "pandas/parser.pyx", line 868, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:9602)
  File "pandas/parser.pyx", line 1865, in pandas.parser.raise_parser_error (pandas/parser.c:23325)
pandas.io.common.CParserError: Error tokenizing data. C error: Buffer overflow caught - possible malformed input file.
用于read_csv()的Python引擎的

readlines()错误

Traceback (most recent call last):
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 160, in <module>
    lender_by_msa = lender_PerformanceByMSA()
  File "/Users/paltamura/Desktop/fmData/fmData/exploratory/creditScore_descriptives.py", line 32, in lender_PerformanceByMSA
    date_col_fmt_dict={'firstPaymentDate': '%Y%m'}
  File "/Users/paltamura/Desktop/fmData/fmData/Load/load_loans.py", line 20, in load_data
    engine='python'
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 562, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 315, in _read
    parser = TextFileReader(filepath_or_buffer, **kwds)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 645, in __init__
    self._make_engine(self.engine)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 805, in _make_engine
    self._engine = klass(self.f, **self.options)
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1608, in __init__
    self.columns, self.num_original_columns = self._infer_columns()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1823, in _infer_columns
    line = self._buffered_line()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 1975, in _buffered_line
    return self._next_line()
  File "/anaconda/lib/python2.7/site-packages/pandas/io/parsers.py", line 2006, in _next_line
    orig_line = next(self.data)
_csv.Error: new-line character seen in unquoted field - do you need to open the file in universal-newline mode?

1 个答案:

答案 0 :(得分:0)

Python有一个替换命令:  cleaning_string = cleaning_string.replace(“,”,“”)

上面的命令将用“”(无)或任何你想要的内容替换所有逗号。该字符串保持原样,但没有逗号。