Question

In the following dataframe

d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
     'tin': [12, 23, 24, 28,30, 12,7],
     'ptin': [12, 23, 28, 22, 12, 12,0] }

df = pd.DataFrame(data=d)

If I run following code:

df = (df.groupby(['ptin', 'tin', 'year'])
                  .apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
                  .reset_index(name='matches'))
df

I get following result

    ptin    tin   year   matches
0   12      3.0   1999   0
1   12      3.0   2001   0
2   22      1.0   2002   0
3   23      1.0   2002   0

This gives me the matching tin to ptin and groups by year.

Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below

    ptin    tin   year   matches    lastoccurence   length 
0   12      3.0   1999   0            0               0
1   12      3.0   2001   0            2001            2
2   22      1.0   2002   0            2002            1
3   23      1.0   2002   0            2002            1

Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.

Answer 1

我认为这会做魔术（至少是部分？）：

df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)

     ptin  tin  year  matches  duration
2    12    12  2001        1       2.0
3    12    30  2004        0       3.0

grouping by count, year and displaying the last occurence and its count

1 个答案: