Question

我有一个我想要遵循的相当具体的算法。

基本上我有一个如下数据框：

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       12
4       1       2       59
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       45
9       3       1       43
10      3       2       43
11      4       1       23
12      4       2       94

我想做到这一点，以便得分＆＃39;在连续= = 2的情况下，将列更改为100，直到该月末。因此，并非所有出现的= = 2都将其得分设置为100，如果在该月期间的任何一天取得== 1。

所以我想要的结果是：

        month   taken   score
1       1       2       23
2       1       1       34
3       1       2       100
4       1       2       100
5       2       1       12
6       2       2       23
7       2       1       43
8       2       2       100
9       3       1       43
10      3       2       43
11      3       1       23
12      3       2       100
13      4       1       32
14      4       2       100

我已经编写了这段代码，我觉得应该这样做：

#iterate through months
for month in range(12):
    #iterate through scores
    for score in range(len(df_report.loc[df_report['month'] == month+1])):
        #starting from the bottom, of that month, if 'taken' == 2...
        if df_report.loc[df_report.month==month+1, 'taken'].iloc[-score-1] == 2:
            #then set the score to 100
            df_report.loc[df_report.month==month+1, 'score'].iloc[-score-2] = 100
        #if you run into a 'taken' == 1, move on to next month
        else: break

但是，即使没有出现错误，这似乎也不会改变任何值...它也没有给我一个关于为复制的数据帧设置值的错误。

有人能解释我做错了吗？

Answer 1

您的值未更新的原因是，iloc的分配更新了前一个loc调用返回的副本，因此未触及原始内容。

以下是我如何解决这个问题。首先，定义一个函数foo。

def foo(df):
    for i in reversed(df.index):
        if df.loc[i, 'taken'] != 2:
            break
        df.loc[i, 'score'] = 100
        i -= 1
    return df

现在，groupby month并致电foo：

df = df.groupby('month').apply(foo)
print(df) 
    month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

显然，apply有它的缺点，但我想不出这个问题的矢量化方法。

Answer 2

你可以做到

import numpy as np
def get_value(x):
    s = x['taken']
    # Get a mask of duplicate sequeence and change values using np.where
    mask = s.ne(s.shift()).cumsum().duplicated(keep=False)
    news = np.where(mask,100,x['score'])

    # if last number is 2 then change the news value to 100
    if s[s.idxmax()] == 2: news[-1] = 100 
    return pd.Series(news)

df['score'] = df.groupby('month').apply(get_value).values

输出：

   month  taken  score
1       1      2     23
2       1      1     34
3       1      2    100
4       1      2    100
5       2      1     12
6       2      2     23
7       2      1     43
8       2      2    100
9       3      1     43
10      3      2    100
11      4      1     23
12      4      2    100

几乎相同的速度，但@coldspeed是赢家

ndf = pd.concat([df]*10000).reset_index(drop=True)

%%timeit
ndf['score'] = ndf.groupby('month').apply(foo)
10 loops, best of 3: 40.8 ms per loop


%%timeit  
ndf['score'] = ndf.groupby('month').apply(get_value).values
10 loops, best of 3: 42.6 ms per loop

根据行和列条件设置pandas数据帧值

2 个答案: