如何获取数据帧行,以及列何时达到值

时间:2017-11-04 13:39:12

标签: python pandas

我有一个pandas数据帧,

df = pd.DataFrame([['@1','A',2],['@2','A',1],['@3','A',4],['@4','B',1],['@5','B',1],['@6','B',3],['@7',
'B',3],['@8','C',4]],columns=['id','channel','people'])

   id channel  people
0  @1       A       2
1  @2       A       1
2  @3       A       4
3  @4       B       1
4  @5       B       1
5  @6       B       3
6  @7       B       3
7  @8       C       4

我想取出一些行,行的总和不能大于值

所以我的代码是,

num = 5           # the sum of column name 'people' should <= num
list = []

for i in range(0,len(df)) :
    num = num - df.loc[i,'people']
    if (num > 0):
        list.append(df.loc[i].copy(deep=True))
    elif (num == 0):
        list.append(df.loc[i].copy(deep=True))
        break
    else:
        list.append(df.loc[i].copy(deep=True))
        list[i]['people'] = num + df.loc[i,'people']
        break
dfnew = pd.DataFrame(list,columns=df.columns)

   id channel  people
0  @1       A       2
1  @2       A       1
2  @3       A       2

但我觉得我写的太复杂了,

你能建议一个更好的算法吗?

由于

1 个答案:

答案 0 :(得分:1)

解决方案

df = pd.DataFrame([['@1','A',2],['@2','A',1],['@3','A',4],['@4','B',1],
                   ['@5','B',1],['@6','B',3],['@7','B',3],['@8','C',4]],
                  columns=['id','channel','people'])

>>> df
Out[]:
   id channel  people
0  @1       A       2
1  @2       A       1
2  @3       A       4
3  @4       B       1
4  @5       B       1
5  @6       B       3
6  @7       B       3
7  @8       C       4

# Get rows including the one that goes beyond the threshold
new_df = df[df.people.cumsum().shift(1).fillna(0) < 5].copy()

>>> new_df
Out[]:
   id channel  people
0  @1       A       2
1  @2       A       1
2  @3       A       4

# Limit value of last row to match threshold condition
new_df.loc[:, 'people'].clip_upper(5 - new_df.people.cumsum().shift(1).fillna(0),
                                   inplace=True)

>>> new_df
Out[]:
   id channel  people
0  @1       A       2
1  @2       A       1
2  @3       A       2

操作实例

提取行

# Get cumulative sum for `people`
>>> df.people.cumsum()
Out[]:
0     2
1     3
2     7
3     8
4     9
5    12
6    15
7    19
Name: people, dtype: int64

# Shift by 1 to include border value
>>> df.people.cumsum().shift(1)
Out[]:
0     NaN
1     2.0
2     3.0
3     7.0
4     8.0
5     9.0
6    12.0
7    15.0
Name: people, dtype: float64

# Fill `NaN` with 0 and create `bool` array  with `< 5`
# this gives the index of rows to be extracted
>>> df.people.cumsum().shift(1).fillna(0) < 5
Out[]:
0     True
1     True
2     True
3    False
4    False
5    False
6    False
7    False
Name: people, dtype: bool

然后,为了限制最后一个值保持小于或等于阈值,clip_upper与移位的累积和df_new.people一起使用。 使用clip代替5 - new_df.iloc[-1].people.sum(),可以考虑people的总和不会达到5的情况。

注意请注意inplace的{​​{1}}参数是版本pandas.clip_upper中的新参数

修改

修复0.21无法正常工作以及clip_upper pandas警告。