我有一个pandas数据帧,
df = pd.DataFrame([['@1','A',2],['@2','A',1],['@3','A',4],['@4','B',1],['@5','B',1],['@6','B',3],['@7',
'B',3],['@8','C',4]],columns=['id','channel','people'])
id channel people
0 @1 A 2
1 @2 A 1
2 @3 A 4
3 @4 B 1
4 @5 B 1
5 @6 B 3
6 @7 B 3
7 @8 C 4
我想取出一些行,行的总和不能大于值
所以我的代码是,
num = 5 # the sum of column name 'people' should <= num
list = []
for i in range(0,len(df)) :
num = num - df.loc[i,'people']
if (num > 0):
list.append(df.loc[i].copy(deep=True))
elif (num == 0):
list.append(df.loc[i].copy(deep=True))
break
else:
list.append(df.loc[i].copy(deep=True))
list[i]['people'] = num + df.loc[i,'people']
break
dfnew = pd.DataFrame(list,columns=df.columns)
id channel people
0 @1 A 2
1 @2 A 1
2 @3 A 2
但我觉得我写的太复杂了,
你能建议一个更好的算法吗?
由于
答案 0 :(得分:1)
df = pd.DataFrame([['@1','A',2],['@2','A',1],['@3','A',4],['@4','B',1],
['@5','B',1],['@6','B',3],['@7','B',3],['@8','C',4]],
columns=['id','channel','people'])
>>> df
Out[]:
id channel people
0 @1 A 2
1 @2 A 1
2 @3 A 4
3 @4 B 1
4 @5 B 1
5 @6 B 3
6 @7 B 3
7 @8 C 4
# Get rows including the one that goes beyond the threshold
new_df = df[df.people.cumsum().shift(1).fillna(0) < 5].copy()
>>> new_df
Out[]:
id channel people
0 @1 A 2
1 @2 A 1
2 @3 A 4
# Limit value of last row to match threshold condition
new_df.loc[:, 'people'].clip_upper(5 - new_df.people.cumsum().shift(1).fillna(0),
inplace=True)
>>> new_df
Out[]:
id channel people
0 @1 A 2
1 @2 A 1
2 @3 A 2
提取行
# Get cumulative sum for `people`
>>> df.people.cumsum()
Out[]:
0 2
1 3
2 7
3 8
4 9
5 12
6 15
7 19
Name: people, dtype: int64
# Shift by 1 to include border value
>>> df.people.cumsum().shift(1)
Out[]:
0 NaN
1 2.0
2 3.0
3 7.0
4 8.0
5 9.0
6 12.0
7 15.0
Name: people, dtype: float64
# Fill `NaN` with 0 and create `bool` array with `< 5`
# this gives the index of rows to be extracted
>>> df.people.cumsum().shift(1).fillna(0) < 5
Out[]:
0 True
1 True
2 True
3 False
4 False
5 False
6 False
7 False
Name: people, dtype: bool
然后,为了限制最后一个值保持小于或等于阈值,clip_upper
与移位的累积和df_new.people
一起使用。
使用clip
代替5 - new_df.iloc[-1].people.sum()
,可以考虑people
的总和不会达到5
的情况。
注意请注意inplace
的{{1}}参数是版本pandas.clip_upper
中的新参数
修改强>
修复0.21
无法正常工作以及clip_upper
pandas
警告。