Python:连续几天的最大出现次数

时间:2017-12-12 08:34:48

标签: python-2.7 pandas

我有一个输入文件:

ID,ROLL_NO,ADM_DATE,FEES
1,12345,01/12/2016,500
2,12345,02/12/2016,200
3,987654,01/12/2016,1000
4,12345,03/12/2016,0
5,12345,04/12/2016,0
6,12345,05/12/2016,100
7,12345,06/12/2016,0
8,12345,07/12/2016,0
9,12345,08/12/2016,0
10,987654,02/12/2016,150
11,987654,03/12/2016,300

我正在尝试找到特定FEES ROLL_NO为0的连续天数的最大值。如果FEES连续几天不等于零,则特定ROLL_NO的最大数量将为零。

预期产出:

ID,ROLL_NO,MAX_CNT --  First occurrence of ID for a particular ROLL_NO should come as ID in output 
1,12345,3
3,987654,0

这是我到目前为止所提出的,

import pandas as pd

df = pd.read_csv('I5.txt')
df['COUNT'] = df.groupby(['ROLLNO','ADM_DATE'])['ROLLNO'].transform(pd.Series.value_counts)
print df

但我不相信这是解决这个问题的正确方法。

有人可以在这里帮助一个蟒蛇新手吗?

1 个答案:

答案 0 :(得分:1)

您可以使用:

#consecutive groups
r = df['ROLL_NO'] * df['FEES'].eq(0)
a = r.ne(r.shift()).cumsum()
print (a)
ID
1     1
2     1
3     1
4     2
5     2
6     3
7     4
8     4
9     4
10    5
11    5
dtype: int32

#filter 0 FEES, count, get max per first level and last add missing roll no by reindex
mask = df['FEES'].eq(0)
df = (df[mask].groupby(['ROLL_NO',a[mask]])
              .size()
              .max(level=0)
              .reindex(df['ROLL_NO'].unique(), fill_value=0)
              .reset_index(name='MAX_CNT'))
print (df)

   ROLL_NO  MAX_CNT
0    12345        3
1   987654        0

说明:

首先将FEES列与0进行比较,eq==相同,并按列ROLL_NO进行多次掩码:

mask = df['FEES'].eq(0)
r = df['ROLL_NO'] * mask
print (r)
0         0
1         0
2         0
3     12345
4     12345
5         0
6     12345
7     12345
8     12345
9         0
10        0
dtype: int64

通过比较shift ed系列rcumsum来获取连续的群组:

a = r.ne(r.shift()).cumsum()
print (a)
0     1
1     1
2     1
3     2
4     2
5     3
6     4
7     4
8     4
9     5
10    5
dtype: int32

仅在0中过滤FEES,在size中过滤,同时过滤a相同的索引:

print (df[mask].groupby(['ROLL_NO',a[mask]]).size())
ROLL_NO   
12345    2    2
         4    3
dtype: int64

每个MultiIndex的第一级获取max值:

print (df[mask].groupby(['ROLL_NO',a[mask]]).size().max(level=0))
ROLL_NO
12345    3
dtype: int64

最后在reindex之后添加ROLL_NO 0而不是print (df[mask].groupby(['ROLL_NO',a[mask]]) .size() .max(level=0) .reindex(df['ROLL_NO'].unique(), fill_value=0)) ROLL_NO 12345 3 987654 0 dtype: int64

index

以及来自ID的列使用reset_index

编辑:

首先r = df['ROLL_NO'] * df['FEES'].eq(0) a = r.ne(r.shift()).cumsum() s = df.drop_duplicates('ROLL_NO').set_index('ROLL_NO')['ID'] mask = df['FEES'].eq(0) df1 = (df[mask].groupby(['ROLL_NO',a[mask]]) .size() .max(level=0) .reindex(df['ROLL_NO'].unique(), fill_value=0) .reset_index(name='MAX_CNT')) df1.insert(0, 'ID', df1['ROLL_NO'].map(s)) print (df1) ID ROLL_NO MAX_CNT 0 1 12345 3 1 3 987654 0 使用drop_duplicatesinsertmap

pipelineJob('My pipeline job'){
displayName('display name')
logRotator {
    numToKeep(10)
    daysToKeep(30)
    artifactDaysToKeep(7)
    artifactNumToKeep(1)
}
definition{
    cps {
        script(readFileFromWorkspace('./cicd/pipelines/clone_git_code.groovy'))
        script(readFileFromWorkspace('./cicd/pipelines/install_dependencies_run_quality_checks.groovy'))
    }
}
}