我有一个包含200多个记录值(列)的大型时间序列数据集。有些值需要平均,有些需要求和,我有一个列表确定哪个是哪个。我需要帮助找出如何将该列表提供给resample的how =函数。
示例数据:
"Timestamp","TZ","TAO (degF)","RHO (%)","WS (mph)","WD (deg)","RAIN (mm)","OAP (hPa)","INSOL (W/m2)","HAIL (hits/cm2)"......."
2014/04/01 01:01:01.005,n,45.3,88.2,0,0.6,0.339,1.0108,-0.270342,0,68.147808,40.91662,68.15884,40.672356,66.55452,......
2014/04/01 01:02:01.027,n,45.3,88,0,3.4,0.339,1.0108,-0.124948,0,68.216736,40.929836,68.15884,40.656932,66.560072,.......
2014/04/01 01:03:01.050,n,45.3,88,0,1.7,0.34,1.0108,-0.145394,0,68.156064,40.890184,68.103736,40.68332,66.557296,......
我能想到的最好的方法是将列表连接成一个字符串以传递给how =函数,但字符串的串联使函数SeriesGroupBy错误输出。
df = pandas.read_csv(parsedatafile, parse_dates = True, date_parser=lambda x: datetime.datetime.strptime(x, '%Y/%m/%d %H:%M:%S.%f') , index_col=0)
while i < len(recordname):
if recordhow[i]=="Y":
#parseavgsum[i]="sum"
recordhow[i]=str(recordname[i])+str(": sum")
else:
recordhow[i]=str(recordname[i])+str(": mean")
#parseavgsum[i]="mean"
i+=1
df2=df.resample('60Min', how = recordhow)
答案 0 :(得分:2)
我会传递how
字典:
>>> df
WD (deg) RAIN (mm)
Timestamp
2014-04-01 01:01:01.005000 40.916620 68.158840
2014-04-01 01:02:01.027000 40.929836 68.158840
2014-04-01 01:03:01.050000 40.890184 68.103736
[3 rows x 2 columns]
>>> what_to_do = {"WD (deg)": "mean", "RAIN (mm)": "sum"}
>>> df.resample("60Min", how=what_to_do)
RAIN (mm) WD (deg)
Timestamp
2014-04-01 01:00:00 204.421416 40.912213
[1 rows x 2 columns]
我认为使用像你一样的recordhow
列表有点危险,因为很容易让列意外地进行洗牌,在这种情况下你的手段和金额会被取消。使用列名更安全。但如果您有recordhow
,则可以执行以下操作:
>>> recordhow = ["N", "Y"]
>>> how_map = {"Y": "sum", "N": "mean"}
>>> what_to_do = dict(zip(df.columns, [how_map[x] for x in recordhow]))
>>> what_to_do
{'RAIN (mm)': 'sum', 'WD (deg)': 'mean'}
但是,我再次建议远离一个不清楚哪些地图尽可能快地映射到的内容。