我想把这个文本文件(philadelphia.txt)变成一个pandas数据帧:
STATION STATION_NAME DATE TAVG TMAX TMIN
----------------- -------------------------------------------------- -------- -------- -------- --------
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999 74 47
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999 68 50
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999 72 50
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999 83 47
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999 86 55
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999 88 61
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999 83 70
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999 80 66
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999 80 64
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999 77 55
GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999 79 49
但是,如果我使用
data = pd.read_csv('philadelphia.txt', sep="\s+", header=0)
它生成一个正确的标题,但后来遇到了拆分电台名称数据的问题。我希望它包含在列名“STATION_NAME”下,但是sep =“\ s +”会在空格处将其拆分并出现错误。
pandas.errors.ParserError: Error tokenizing data. C error: Expected 6 fields in line 3, saw 11
如何将数据分成6列,而不将电台名称拆分为单个字?
我还希望能够传递其他文本文档,例如(yellowknife.txt)。
STATION STATION_NAME DATE TMAX TMIN
----------------- -------------------------------------------------- -------- -------- --------
GHCND:CA002204101 YELLOWKNIFE A CA 20130117 -21 -35
GHCND:CA002204101 YELLOWKNIFE A CA 20130118 -15 -21
GHCND:CA002204101 YELLOWKNIFE A CA 20130119 -17 -29
GHCND:CA002204101 YELLOWKNIFE A CA 20130120 -18 -28
GHCND:CA002204101 YELLOWKNIFE A CA 20130121 -21 -34
GHCND:CA002204101 YELLOWKNIFE A CA 20130122 -16 -30
GHCND:CA002204101 YELLOWKNIFE A CA 20130123 -17 -28
GHCND:CA002204101 YELLOWKNIFE A CA 20130124 -5 -17
答案 0 :(得分:0)
使用read_fwf()方法:
In [7]: df = pd.read_fwf(r'/path/to/file.csv').drop(0)
In [8]: df
Out[8]:
STATION STATION_NAME DATE TAVG TMAX TMIN
1 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970605 -9999 74 47
2 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970606 -9999 68 50
3 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970608 -9999 72 50
4 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970609 -9999 83 47
5 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970610 -9999 86 55
6 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970611 -9999 88 61
7 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970612 -9999 83 70
8 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970613 -9999 80 66
9 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970614 -9999 80 64
10 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970615 -9999 77 55
11 GHCND:USW00094732 PHILADELPHIA NE PHILADELPHIA AIRPORT PA US 19970616 -9999 79 49
列:
In [9]: df.columns.tolist()
Out[9]: ['STATION', 'STATION_NAME', 'DATE', 'TAVG', 'TMAX', 'TMIN']