道歉的标题;我想不出另一种说法。这是我以几种不同形式遇到的问题,无法找到满意的答案。
例如:说我一直在监视一周中喝了几杯茶和咖啡:
In [17]: import random
...: test = pd.DataFrame({
...: 'drink' : ['tea'] * 5 + ['coffee'] * 5,
...: 'day' : ['monday', 'tuesday', 'wednesday', 'thursday', 'friday'] * 2,
...: 'cups' : [random.randrange(1, 10) for _ in range(10)]
...: })
...: test
...:
...:
Out[17]:
drink day cups
0 tea monday 1
1 tea tuesday 3
2 tea wednesday 1
3 tea thursday 7
4 tea friday 1
5 coffee monday 8
6 coffee tuesday 1
7 coffee wednesday 2
8 coffee thursday 1
9 coffee friday 1
为了比较金额,我想将其标准化。我可以轻松地通过除以每天的总数进行归一化-这几乎是对熊猫进行归一化的标准示例:
In [18]: test['day_norm'] = test.groupby('day')['cups'].transform(lambda x : x /
...: x.sum())
In [19]: test
Out[19]:
drink day cups day_norm
0 tea monday 1 0.111111
1 tea tuesday 3 0.750000
2 tea wednesday 1 0.333333
3 tea thursday 7 0.875000
4 tea friday 1 0.500000
5 coffee monday 8 0.888889
6 coffee tuesday 1 0.250000
7 coffee wednesday 2 0.666667
8 coffee thursday 1 0.125000
9 coffee friday 1 0.500000
但是让我们说,我想通过将每个组除以星期一的值来查看一周中值的变化情况,即我希望星期一为1,然后相对于此隔天。我设法提出了两种不同的方法,这两种方法似乎都是令人费解的。
一个:我可以编写一个函数来过滤组数据框以查找星期一的值,然后将其除以该序列:
In [20]: def normalize(df):
...: monday_cups = df[df['day'] == 'monday']['cups'].mean()
...: return df['cups'] / monday_cups
...:
...: test['normalized cups'] = test.groupby('drink').apply(normalize).reset_i
...: ndex(level=0, drop=True)
...: test
...:
...:
Out[20]:
drink day cups day_norm normalized cups
0 tea monday 1 0.111111 1.000
1 tea tuesday 3 0.750000 3.000
2 tea wednesday 1 0.333333 1.000
3 tea thursday 7 0.875000 7.000
4 tea friday 1 0.500000 1.000
5 coffee monday 8 0.888889 1.000
6 coffee tuesday 1 0.250000 0.125
7 coffee wednesday 2 0.666667 0.250
8 coffee thursday 1 0.125000 0.125
9 coffee friday 1 0.500000 0.125
但是要使索引与原始数据帧的索引相匹配就需要对索引进行大量的处理。
二:我可以将数据重塑为宽格式的表:
n [14]: summary = test.drop(columns=['normalized cups']).groupby(['drink', 'day'])['cups'].mean().unstack()
In [15]: summary
Out[15]:
day friday monday thursday tuesday wednesday
drink
coffee 8 7 7 8 4
tea 9 9 4 8 4
然后划分变得更加简单,但是我不得不花一些时间将其恢复为原始格式:
In [16]: summary.apply(lambda x : x / summary['monday']).stack().to_frame('norma
...: lized_cups').reset_index()
Out[16]:
drink day normalized_cups
0 coffee friday 1.142857
1 coffee monday 1.000000
2 coffee thursday 1.000000
3 coffee tuesday 1.142857
4 coffee wednesday 0.571429
5 tea friday 1.000000
6 tea monday 1.000000
7 tea thursday 0.444444
8 tea tuesday 0.888889
9 tea wednesday 0.444444
是否有更优雅的方法来做到这一点?我有一个模糊的想法,对数据框进行排序,以使星期一为第一,然后进行涉及groupby
和first
的事情,但我想不起来!
答案 0 :(得分:1)
这就是我要做的
t2=test.loc[test.day=='monday',['drink','cups']].groupby('drink').cups.mean()
t2
Out[1282]:
drink
coffee 8
tea 1
Name: cups, dtype: int64
test['normalized_cups']=test.cups/t2.reindex(test.drink).values
答案 1 :(得分:1)
尝试:
df['normalized_cups'] = df.groupby('drink').cups.apply(lambda x: x/x.iloc[0])
这假设您在每个组中首先拥有monday
。
答案 2 :(得分:1)