我试图根据条件列在DataFrame
中乘以一行。
例如,当条件列中的值为2时,我希望用两个相同的行替换该行,并将每个新行中的条件设置为1.
示例DataFrame:
df = pd.DataFrame({'k': ['K0', 'K1', 'K1', 'K2'],
'condition': [1, 1, 3, 2],
's': ['a', 'b', 'c', 'd']})
condition k s
1 K0 a
1 K1 b
3 K1 c
2 K2 d
期望的结果:
condition k s
1 K0 a
1 K1 b
1 K1 c
1 K1 c
1 K1 c
1 K2 d
1 K2 d
是否可以有效地完成此操作inplace
,而无需创建临时df
?
答案 0 :(得分:1)
df = df.loc[np.repeat(df.index.values,df.condition)].reset_index(drop=True)
df['condition'] = 1
print df
condition k s
0 1 K0 a
1 1 K1 b
2 1 K1 c
3 1 K1 c
4 1 K1 c
5 1 K2 d
6 1 K2 d
groupby
concat
的另一个解决方案condition
以及1
列中df = df.groupby('condition', as_index=False, sort=False)
.apply(lambda x: pd.concat([x]*x.condition.values[0], ignore_index=True))
.reset_index(drop=True)
df['condition'] = 1
print df
condition k s
0 1 K0 a
1 1 K1 b
2 1 K1 c
3 1 K1 c
4 1 K1 c
5 1 K2 d
6 1 K2 d
的最后设定值,但速度较慢:
In [917]: %timeit df.loc[np.repeat(df.index.values,df.condition)].reset_index(drop=True)
The slowest run took 4.55 times longer than the fastest. This could mean that an intermediate result is being cached
1000 loops, best of 3: 1.04 ms per loop
In [918]: %timeit df.groupby('condition', as_index=False, sort=False).apply(lambda x: pd.concat([x]*x.condition.values[0], ignore_index=True)).reset_index(drop=True)
100 loops, best of 3: 7.78 ms per loop
<强>计时强>:
emp_num trans_date day_type
5667 2016-03-01 1
5667 2016-03-02 1
5667 2016-03-03 1
5667 2016-03-04 3
5667 2016-03-05 3
5667 2016-03-06 1
5667 2016-03-07 1
5667 2016-03-08 1
5667 2016-03-09 1
5667 2016-03-10 1
5667 2016-03-11 3
5667 2016-03-12 3
5667 2016-03-13 1
5667 2016-03-14 1
5667 2016-03-15 1
5667 2016-03-16 1
5667 2016-03-17 1
5667 2016-03-18 3
5667 2016-03-19 3
5667 2016-03-20 1
5667 2016-03-21 1
5667 2016-03-22 1
5667 2016-03-23 1
5667 2016-03-24 1
5667 2016-03-25 3
5667 2016-03-26 3
5667 2016-03-27 1
5667 2016-03-28 1
5667 2016-03-29 1
5667 2016-03-30 1
5667 2016-03-31 1