我正在尝试将Pandas : populate column with if condition not working as expected中的文本数据读入数据帧。我的代码是:
dftxt = """
0 1 2
1 10/1/2016 'stringvalue' 456
2 NaN 'anothersting' NaN
3 NaN 'and another ' NaN
4 11/1/2016 'more strings' 943
5 NaN 'stringstring' NaN
"""
from io import StringIO
df = pd.read_csv(StringIO(dftxt), sep='\s+')
print (df)
但我收到以下错误:
Traceback (most recent call last):
File "mydf.py", line 16, in <module>
df = pd.read_csv(StringIO(dftxt), sep='\s+')
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 646, in parser_f
return _read(filepath_or_buffer, kwds)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 401, in _read
data = parser.read()
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 939, in read
ret = self._engine.read(nrows)
File "/usr/lib/python3/dist-packages/pandas/io/parsers.py", line 1508, in read
data = self._reader.read(nrows)
File "pandas/parser.pyx", line 848, in pandas.parser.TextReader.read (pandas/parser.c:10415)
File "pandas/parser.pyx", line 870, in pandas.parser.TextReader._read_low_memory (pandas/parser.c:10691)
File "pandas/parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas/parser.c:11437)
File "pandas/parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas/parser.c:11308)
File "pandas/parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas/parser.c:27037)
pandas.io.common.CParserError: Error tokenizing data. C error: Expected 4 fields in line 5, saw 6
我无法理解错误地读取了哪6个字段:Expected 4 fields in line 5, saw 6
。问题在哪里以及如何解决?
答案 0 :(得分:1)
第5行就是这个 -
3 NaN 'and another ' NaN
1 2 3 4 5 6
问题在于您的分隔符。它将每个空格分隔的单词解释为单独的列。在这种情况下,您需要
sep
参数更改为\s{2,}
和'python'
以取消警告
df = pd.read_csv(StringIO(dftxt), sep='\s{2,}', engine='python')
另外,我使用str.strip
删除引号(它们是多余的) -
df.iloc[:, 1] = df.iloc[:, 1].str.strip("'")
df
0 1 2
1 10/1/2016 stringvalue 456.0
2 NaN anothersting NaN
3 NaN and another NaN
4 11/1/2016 more strings 943.0
5 NaN stringstring NaN
最后,从一个pandas用户到另一个pandas用户,有一个名为pd.read_clipboard
的小便利功能我觉得你应该看看。它从剪贴板读取数据,并接受read_csv
所做的每个参数。