无法在python

时间:2017-12-30 14:43:59

标签: python pandas dataframe

我正在尝试将Pandas : populate column with if condition not working as expected中的文本数据读入数据帧。我的代码是:

dftxt = """
    0             1               2
1  10/1/2016    'stringvalue'     456
2  NaN          'anothersting'    NaN
3  NaN          'and another '    NaN
4  11/1/2016    'more strings'    943
5  NaN          'stringstring'    NaN
"""

from io import StringIO
df = pd.read_csv(StringIO(dftxt), sep='\s+')
print (df)

但我收到以下错误:

Traceback (most recent call last):
  File "mydf.py", line 16, in <module>
    df = pd.read_csv(StringIO(dftxt), sep='\s+')
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read
    data = parser.read()
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read
    ret = self._engine.read(nrows)
  File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read
    data = self._reader.read(nrows)
  File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)
  File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)
  File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)
  File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)
  File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 5, saw 6

我无法理解错误地读取了哪6个字段:Expected 4 fields in line 5, saw 6。问题在哪里以及如何解决?

1 个答案:

答案 0 :(得分:1)

第5行就是这个 -

 3  NaN          'and another '    NaN
 1   2             3    4     5     6

问题在于您的分隔符。它将每个空格分隔的单词解释为单独的列。在这种情况下,您需要

  • 将您的sep参数更改为\s{2,}
  • 将您的引擎更改为'python'以取消警告

df = pd.read_csv(StringIO(dftxt), sep='\s{2,}', engine='python')

另外,我使用str.strip删除引号(它们是多余的) -

df.iloc[:, 1] = df.iloc[:, 1].str.strip("'")
df

           0             1      2
1  10/1/2016   stringvalue  456.0
2        NaN  anothersting    NaN
3        NaN  and another     NaN
4  11/1/2016  more strings  943.0
5        NaN  stringstring    NaN

最后,从一个pandas用户到另一个pandas用户,有一个名为pd.read_clipboard的小便利功能我觉得你应该看看。它从剪贴板读取数据,并接受read_csv所做的每个参数。