我有一些由多列组成的天气数据集:
StationID,海拔,日期时间,经度,纬度,降雨量
我有多个电台,由各自的ID标识。降雨量栏积累了降雨量。例如,对于10天内的X站,我可以(以毫米/天为单位):
站X,0 0 0 1 5 6 6 8 8 15
对于Y站,我可以
*站Y,0 1 14 14 14 15 18 18 18 20
但我需要的是强度值,即从一天减去另一天的量。这将为站X和Y提供以下值(第一个值从0开始),
站X,0 0 0 1 4 1 0 2 0 7
站Y,0 1 13 0 0 1 3 0 0 2
我创建了一个函数,它接受一个时间序列并计算这个差异:
def intensity(ts):
ts2 = [0]
for i in range(0,len(ts[:-1])):
ts2.append((ts[i+1]-ts[i]))
return ts2
test = [1,2,3,4,5,10,10,10,20,25]
intensity(test)
现在,我的问题是:如何将此功能应用于降雨量'我的数据框中每个电台组的列,即:
dfg = df.groupby('station')
然后将输出分配到数据框中的新列(例如:' rain_intensity'列)?
答案 0 :(得分:1)
我认为你需要:
print (df.groupby('station')['rainfall'].apply(intensity))
但更好的是diff
,fillna
将NaN
替换为0
,然后根据需要转换为int
:
print (df.groupby('StationID')['rainfall'].diff().fillna(0))
样品:
df = pd.DataFrame({'rainfall': [0, 0, 0 ,1, 5, 6, 6, 8, 8, 15, 0, 1, 14, 14, 14, 15, 18, 18, 18, 20],
'StationID': ['station X'] * 10 + ['station Y'] * 10})
print (df)
StationID rainfall
0 station X 0
1 station X 0
2 station X 0
3 station X 1
4 station X 5
5 station X 6
6 station X 6
7 station X 8
8 station X 8
9 station X 15
10 station Y 0
11 station Y 1
12 station Y 14
13 station Y 14
14 station Y 14
15 station Y 15
16 station Y 18
17 station Y 18
18 station Y 18
19 station Y 20
def intensity(ts):
ts = ts.tolist()
ts2 = [0]
for i in range(0,len(ts[:-1])):
ts2.append((ts[i+1]-ts[i]))
return pd.Series(ts2)
df['diff1'] = df.groupby('StationID')['rainfall'].apply(intensity).reset_index(drop=True)
df['diff2'] = df.groupby('StationID')['rainfall'].diff().fillna(0).astype(int)
print (df)
StationID rainfall diff1 diff2
0 station X 0 0 0
1 station X 0 0 0
2 station X 0 0 0
3 station X 1 1 1
4 station X 5 4 4
5 station X 6 1 1
6 station X 6 0 0
7 station X 8 2 2
8 station X 8 0 0
9 station X 15 7 7
10 station Y 0 0 0
11 station Y 1 1 1
12 station Y 14 13 13
13 station Y 14 0 0
14 station Y 14 0 0
15 station Y 15 1 1
16 station Y 18 3 3
17 station Y 18 0 0
18 station Y 18 0 0
19 station Y 20 2 2