当大熊猫阅读由空白分隔的文本时,令人费解

时间:2017-06-26 14:55:30

标签: pandas

pandas无法读取文字如下:

NothGrassland Meteor Sites
MTCLIM v4.3 OUTPUT FILE : Mon Jun 26 16:57:31 2017
  year  yday    Tmax    Tmin    Tday    prcp      VPD     srad  daylen
             (deg C) (deg C) (deg C)    (cm)     (Pa)  (W m-2)     (s)
  1961     1  -24.08  -36.19  -27.41    0.00    36.81   128.45   28460
  1961     2  -16.08  -29.79  -19.85    0.02    75.12   135.12   28524
  1961     3  -16.08  -26.19  -18.86    0.05    65.86   118.79   28594
  1961     4  -23.58  -33.29  -26.25    0.00    34.87   116.98   28668
  1961     5  -24.28  -37.49  -27.91    0.00    37.27   163.75   28748
  1961     6  -20.68  -33.19  -24.12    0.01    49.79   133.63   28832
  1961     7  -19.48  -31.29  -22.73    0.18    53.78   131.91   28922

阅读文本时使用代码如下:

df=pd.read_csv(file,sep=' ',header=0,skiprows=[0,1,3])

提示错误:

runfile('C:/temp/python/Models/GSI.py', wdir='C:/temp/python')
Traceback (most recent call last):

  File "<ipython-input-115-7bbdd08f49f8>", line 1, in <module>
    runfile('C:/temp/python/Models/GSI.py', wdir='C:/temp/python')

  File "C:\Program Files\Winpython\python-3.6.1.amd64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
    execfile(filename, namespace)

  File "C:\Program Files\Winpython\python-3.6.1.amd64\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/temp/python/Models/GSI.py", line 23, in <module>
    df=pd.read_csv(file,header=0,sep=' ')

  File "C:\Program Files\Winpython\python-3.6.1.amd64\lib\site-packages\pandas\io\parsers.py", line 646, in parser_f
    return _read(filepath_or_buffer, kwds)

  File "C:\Program Files\Winpython\python-3.6.1.amd64\lib\site-packages\pandas\io\parsers.py", line 401, in _read
    data = parser.read()

  File "C:\Program Files\Winpython\python-3.6.1.amd64\lib\site-packages\pandas\io\parsers.py", line 939, in read
    ret = self._engine.read(nrows)

  File "C:\Program Files\Winpython\python-3.6.1.amd64\lib\site-packages\pandas\io\parsers.py", line 1508, in read
    data = self._reader.read(nrows)

  File "pandas\parser.pyx", line 848, in pandas.parser.TextReader.read (pandas\parser.c:10415)

  File "pandas\parser.pyx", line 870, in  pandas.parser.TextReader._read_low_memory (pandas\parser.c:10691)

  File "pandas\parser.pyx", line 924, in pandas.parser.TextReader._read_rows (pandas\parser.c:11437)

  File "pandas\parser.pyx", line 911, in pandas.parser.TextReader._tokenize_rows (pandas\parser.c:11308)

  File "pandas\parser.pyx", line 2024, in pandas.parser.raise_parser_error (pandas\parser.c:27037)

CParserError: Error tokenizing data. C error: Expected 10 fields in line 3, saw 34

如果删除sep=' ',请执行以下操作:

df=pd.read_csv(file,header=None,skiprows=4)

代码运行。

1 个答案:

答案 0 :(得分:2)

对我来说,作品sep="\s+"delim_whitespace=True

import pandas as pd
from pandas.compat import StringIO

temp=u"""NothGrassland Meteor Sites
MTCLIM v4.3 OUTPUT FILE : Mon Jun 26 16:57:31 2017
  year  yday    Tmax    Tmin    Tday    prcp      VPD     srad  daylen
             (deg C) (deg C) (deg C)    (cm)     (Pa)  (W m-2)     (s)
  1961     1  -24.08  -36.19  -27.41    0.00    36.81   128.45   28460
  1961     2  -16.08  -29.79  -19.85    0.02    75.12   135.12   28524
  1961     3  -16.08  -26.19  -18.86    0.05    65.86   118.79   28594
  1961     4  -23.58  -33.29  -26.25    0.00    34.87   116.98   28668
  1961     5  -24.28  -37.49  -27.91    0.00    37.27   163.75   28748
  1961     6  -20.68  -33.19  -24.12    0.01    49.79   133.63   28832
  1961     7  -19.48  -31.29  -22.73    0.18    53.78   131.91   28922"""
#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), sep="\s+", skiprows=[0,1,3], header=0)

print (df)
   year  yday   Tmax   Tmin   Tday  prcp    VPD    srad  daylen
0  1961     1 -24.08 -36.19 -27.41  0.00  36.81  128.45   28460
1  1961     2 -16.08 -29.79 -19.85  0.02  75.12  135.12   28524
2  1961     3 -16.08 -26.19 -18.86  0.05  65.86  118.79   28594
3  1961     4 -23.58 -33.29 -26.25  0.00  34.87  116.98   28668
4  1961     5 -24.28 -37.49 -27.91  0.00  37.27  163.75   28748
5  1961     6 -20.68 -33.19 -24.12  0.01  49.79  133.63   28832
6  1961     7 -19.48 -31.29 -22.73  0.18  53.78  131.91   28922

还有:

#after testing replace 'StringIO(temp)' to 'filename.csv'
df = pd.read_csv(StringIO(temp), delim_whitespace=True, skiprows=[0,1,3], header=0)

print (df)
   year  yday   Tmax   Tmin   Tday  prcp    VPD    srad  daylen
0  1961     1 -24.08 -36.19 -27.41  0.00  36.81  128.45   28460
1  1961     2 -16.08 -29.79 -19.85  0.02  75.12  135.12   28524
2  1961     3 -16.08 -26.19 -18.86  0.05  65.86  118.79   28594
3  1961     4 -23.58 -33.29 -26.25  0.00  34.87  116.98   28668
4  1961     5 -24.28 -37.49 -27.91  0.00  37.27  163.75   28748
5  1961     6 -20.68 -33.19 -24.12  0.01  49.79  133.63   28832
6  1961     7 -19.48 -31.29 -22.73  0.18  53.78  131.91   28922