使用熊猫读取一些txt文件时遇到问题。
我的文件内容如下所示。
WNS 01.20
57039 108.8833 34.0833 445.8 LC 20150322120000
OOBS
00100 ///// ///// ////// /// /// ////////
00160 216.3 003.7 0006.5 100 100 -1.2E+02
00220 258.9 006.7 0006.6 100 100 -1.3E+02
00280 263.9 007.9 0006.6 100 100 -1.3E+02
前3行不是我想要的,因此我将其忽略。因此,我从“ 00100”行开始读取,有些行没有数据,它将显示为“ ////”,可以在任何行中。
下面是我的代码
import pandas as pd
data = pd.read_table(PathofMYFILE, delim_whitespace=True, skiprows=[0, 1, 2], header=None, comment='/')
当“ ////”不在“ 00100”(实际上是第一行)中显示时(如果有“ ///”,我想要的就是NaN),它会很好地工作。
但是,我们可以看到在此文件的第一行中显示了“ ///”,然后出现了错误:
File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 655, in parser_f
return _read(filepath_or_buffer, kwds)
File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 411, in _read
data = parser.read(nrows)
File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1005, in read
ret = self._engine.read(nrows)
File "D:\Anaconda3\lib\site-packages\pandas\io\parsers.py", line 1748, in read
data = self._reader.read(nrows)
File "pandas/_libs/parsers.pyx", line 890, in pandas._libs.parsers.TextReader.read (pandas\_libs\parsers.c:10862)
File "pandas/_libs/parsers.pyx", line 912, in pandas._libs.parsers.TextReader._read_low_memory (pandas\_libs\parsers.c:11138)
File "pandas/_libs/parsers.pyx", line 966, in pandas._libs.parsers.TextReader._read_rows (pandas\_libs\parsers.c:11884)
File "pandas/_libs/parsers.pyx", line 953, in pandas._libs.parsers.TextReader._tokenize_rows (pandas\_libs\parsers.c:11755)
File "pandas/_libs/parsers.pyx", line 2184, in pandas._libs.parsers.raise_parser_error (pandas\_libs\parsers.c:28765)
pandas.errors.ParserError: Error tokenizing data. C error: Expected 1 fields in line 5, saw 7
我需要一些帮助来解决这个问题。我什至试图在read_table中添加"error_bad_lines=False"
并没有帮助。
有没有更好的方法来读取这些文本文件。请帮忙!
答案 0 :(得分:0)
将test.txt
文件保存为与您复制的文件一样,我提出了几种解决方案。
import pandas as pd
import functools
def main():
data = pd.read_table( # this will not fail, but doesn't produce NaNs
'test.txt', delim_whitespace=True, skiprows=range(0,3), header=None,
)
print(data)
# force conversion to numbers on all rows, if it fails fills with NaNs
data_numeric = data.apply(functools.partial(pd.to_numeric, errors='coerce'))
print(data_numeric)
# if you know all values to be read as NaN, you can just pass them...
# to na_values
data_with_na = pd.read_table(
'test.txt', delim_whitespace=True, skiprows=range(0,3), header=None,
na_values=('/////', '//////', '///', '////////')
)
print(data_with_na)
if __name__=='__main__':
main()
运行:
0 1 2 3 4 5 6
0 100 ///// ///// ////// /// /// ////////
1 160 216.3 003.7 0006.5 100 100 -1.2E+02
2 220 258.9 006.7 0006.6 100 100 -1.3E+02
3 280 263.9 007.9 0006.6 100 100 -1.3E+02
0 1 2 3 4 5 6
0 100 NaN NaN NaN NaN NaN NaN
1 160 216.3 3.7 6.5 100.0 100.0 -120.0
2 220 258.9 6.7 6.6 100.0 100.0 -130.0
3 280 263.9 7.9 6.6 100.0 100.0 -130.0
0 1 2 3 4 5 6
0 100 NaN NaN NaN NaN NaN NaN
1 160 216.3 3.7 6.5 100.0 100.0 -120.0
2 220 258.9 6.7 6.6 100.0 100.0 -130.0
3 280 263.9 7.9 6.6 100.0 100.0 -130.0
总而言之,如果您事先知道要解析为'/'
的{{1}}字符串,最好将它们全部传递给NaN
的{{1}}参数。
na_values
解决方案使用了更多的蛮力,尽管您可以将其限制为仅包含'/'的行以使其更好。