假设每个人都有几条记录,每条记录都有一定的日期。我想构建一个列,指示 每人 ,其他记录的数量少于2个月。也就是说,我只关注个人' A'的记录,然后循环他/她的记录,看看是否有其他记录的个人' A'不到两个月的时间(与当前行/记录相比)。
让我们看一些测试数据,以便更清楚
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
testdf.date = pd.to_datetime(testdf.date)
我会添加预期的计数列
testdf['expected_counts'] = [1, 0, 0, 0, 1, 0]
#Gives:
date id_indiv id_record expected
0 2017-04-28 1 12 1
1 2017-04-05 1 13 0
2 2017-08-05 1 14 0
3 2016-02-01 2 19 0
4 2016-02-05 2 20 1
5 2017-10-05 2 23 0
我的第一个想法是按id_indiv
进行分组,然后使用apply
或transform
进行自定义功能。为了方便起见,我首先添加一个变量,从记录日期开始减去两个月,然后我会为count_months
或{{1}编写apply
自定义函数}
transform
我首先尝试使用testdf['2M_before'] = testdf['date'] - pd.Timedelta('{0}D'.format(30*2))
def count_months(chunk, month_var='2M_before'):
counts = np.empty(len(chunk))
for i, (ind, row) in enumerate(chunk.iterrows()):
#Count records earlier than two months old
#but not newer than the current one
counts[i] = ((chunk.date > row[month_var])
& (chunk.date < row.date)).sum()
return counts
:
transform
但它提供testdf.groupby('id_indiv').transform(count_months)
,我想这意味着AttributeError: ("'Series' object has no attribute 'iterrows'", 'occurred at index date')
将transform
对象传递给自定义函数,但我不知道如何解决这个问题。
然后我尝试了Series
apply
这几乎可行,但它将结果作为列表。 To&#34; unstack&#34;那个清单,我跟着回答on this question:
testdf.groupby('id_indiv').apply(count_months)
#Gives
id_indiv
1 [1.0, 0.0, 0.0]
2 [0.0, 1.0, 0.0]
dtype: object
这似乎有效,但似乎应该有一个更简单的方法(可能使用#First sort, just in case the order gets messed up when pasting back:
testdf = testdf.sort_values(['id_indiv', 'id_record'])
counts = (testdf.groupby('id_indiv').apply(count_months)
.apply(pd.Series).stack()
.reset_index(level=1, drop=True))
#Now create the new column
testdf.set_index('id_indiv', inplace=True)
testdf['mycount'] = counts.astype('int')
assert (testdf.expected == testdf.mycount).all()
#df looks now likes this
date id_record expected 2M_before mycount
id_indiv
1 2017-04-28 12 1 2017-02-27 1
1 2017-04-05 13 0 2017-02-04 0
1 2017-08-05 14 0 2017-06-06 0
2 2016-02-01 19 0 2015-12-03 0
2 2016-02-05 20 1 2015-12-07 1
2 2017-10-05 23 0 2017-08-06 0
?)。此外,像往常一样粘贴专栏并不是很强大。
谢谢你的时间!
答案 0 :(得分:1)
这里有一种方法可以计算严格超过2个月 的所有记录 ,使用的回溯窗口恰好是两个日历月减去1天(而非近似值) 2个月的60天或其他窗口。)
# imports and setup
import pandas as pd
testdf = pd.DataFrame({
'id_indiv': [1, 1, 1, 2, 2, 2],
'id_record': [12, 13, 14, 19, 20, 23],
'date': ['2017-04-28', '2017-04-05', '2017-08-05',
'2016-02-01', '2016-02-05', '2017-10-05'] })
# more setup
testdf['date'] = pd.to_datetime(testdf['date'])
testdf.set_index('date', inplace=True)
testdf.sort_index(inplace=True) # required for the index-slicing below
# solution
count_recent_records = lambda x: [x.loc[d - pd.DateOffset(months=2, days=-1):d].count() - 1 for d in x.index]
testdf['mycount'] = testdf.groupby('id_indiv').transform(count_recent_records)
# output
testdf
id_indiv id_record mycount
date
2016-02-01 2 19 0
2016-02-05 2 20 1
2017-04-05 1 13 0
2017-04-28 1 12 1
2017-08-05 1 14 0
2017-10-05 2 23 0
答案 1 :(得分:0)
testdf = testdf.sort_values('date')
out_df = pd.DataFrame()
for i in testdf.id_indiv.unique():
for d in testdf.date:
date_diff = (d - testdf.loc[testdf.id_indiv == i,'date']).dt.days
out_dict = {'person' : i,
'entry_date' : d,
'count' : sum((date_diff > 0) & (date_diff <= 60))}
out_df = out_df.append(out_dict, ignore_index = True)
out_df
count entry_date person
0 0.0 2016-02-01 2.0
1 1.0 2016-02-05 2.0
2 0.0 2017-04-05 2.0
3 0.0 2017-04-28 2.0
4 0.0 2017-08-05 2.0
5 0.0 2017-10-05 2.0
6 0.0 2016-02-01 1.0
7 0.0 2016-02-05 1.0
8 0.0 2017-04-05 1.0
9 1.0 2017-04-28 1.0
10 0.0 2017-08-05 1.0
11 0.0 2017-10-05 1.0