无法生成数据帧,因为read_csv空格分隔不是常量

时间:2017-06-07 18:52:01

标签: csv pandas dataframe

我想把这个文本文件(philadelphia.txt)变成一个pandas数据帧:

STATION           STATION_NAME                                       DATE     TAVG     TMAX     TMIN     
----------------- -------------------------------------------------- -------- -------- -------- -------- 
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999    74       47       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999    68       50       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999    72       50       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999    83       47       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999    86       55       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999    88       61       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999    83       70       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999    80       66       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999    80       64       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999    77       55       
GHCND:USW00094732         PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999    79       49

但是,如果我使用

data = pd.read_csv('philadelphia.txt', sep="\s+", header=0)

它生成一个正确的标题,但后来遇到了拆分电台名称数据的问题。我希望它包含在列名“STATION_NAME”下,但是sep =“\ s +”会在空格处将其拆分并出现错误。

pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11

如何将数据分成6列,而不将电台名称拆分为单个字?

我还希望能够传递其他文本文档,例如(yellowknife.txt)。

STATION           STATION_NAME                                       DATE     TMAX     TMIN     
----------------- -------------------------------------------------- -------- -------- -------- 
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130117 -21      -35      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130118 -15      -21      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130119 -17      -29      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130120 -18      -28      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130121 -21      -34      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130122 -16      -30      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130123 -17      -28      
GHCND:CA002204101                                   YELLOWKNIFE A CA 20130124 -5       -17      

1 个答案:

答案 0 :(得分:0)

使用read_fwf()方法:

In [7]: df = pd.read_fwf(r'/path/to/file.csv').drop(0)

In [8]: df
Out[8]:
              STATION                                STATION_NAME      DATE   TAVG TMAX TMIN
1   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970605  -9999   74   47
2   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970606  -9999   68   50
3   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970608  -9999   72   50
4   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970609  -9999   83   47
5   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970610  -9999   86   55
6   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970611  -9999   88   61
7   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970612  -9999   83   70
8   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970613  -9999   80   66
9   GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970614  -9999   80   64
10  GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970615  -9999   77   55
11  GHCND:USW00094732  PHILADELPHIA NE PHILADELPHIA AIRPORT PA US  19970616  -9999   79   49

列:

In [9]: df.columns.tolist()
Out[9]: ['STATION', 'STATION_NAME', 'DATE', 'TAVG', 'TMAX', 'TMIN']