In the following dataframe
d = {'year': [2001, 2002, 2005, 2002, 2004, 1999, 1890],
'tin': [12, 23, 24, 28,30, 12,7],
'ptin': [12, 23, 28, 22, 12, 12,0] }
df = pd.DataFrame(data=d)
If I run following code:
df = (df.groupby(['ptin', 'tin', 'year'])
.apply(lambda x : x['tin'].isin(x['ptin']).astype(int).sum())
.reset_index(name='matches'))
df
I get following result
ptin tin year matches
0 12 3.0 1999 0
1 12 3.0 2001 0
2 22 1.0 2002 0
3 23 1.0 2002 0
This gives me the matching tin to ptin and groups by year.
Now if I want to find the last occurence of say for example tin == 12, I should get 2001. I want add that column as well as difference between 1999 and 2001, which is two in different column, such that my answer looks like below
ptin tin year matches lastoccurence length
0 12 3.0 1999 0 0 0
1 12 3.0 2001 0 2001 2
2 22 1.0 2002 0 2002 1
3 23 1.0 2002 0 2002 1
Any help would be appreciated. I could take solution in either pandas or SQL if that is possible.
答案 0 :(得分:0)
我认为这会做魔术(至少是部分?):
df['duration'] = df.sort_values(['ptin','year']).groupby('ptin')['year'].diff()
df = df.dropna(subset=['duration'])
print (df)
ptin tin year matches duration
2 12 12 2001 1 2.0
3 12 30 2004 0 3.0