我的数据框如下:
respondent_id,group_number,member_id
1,1,3
1,1,4
1,2,1
....
我的目标是为每个受访者ID输出两个计数;包含自己作为会员ID的群组数量,以及不包含的群组数量。
例如,上表将输出:
respondent_id,my_groups,other_groups
1,1,1
我最好的猜测是做一些事情:
rg_g = df.groupby(['respondent_id','group_number'])
rg_g.apply(lambda g: g.respondent_id in g.id.values)
但我不知道从哪里去。
答案 0 :(得分:1)
更新的答案(这不是最好的代码,但它有效):
初始化:
test_data = pd.DataFrame(np.random.randint(5, size=(10, 3)),columns=['respondent_id','group_number','member_id'])
test_data['member_id'][3]=None
test_data['member_id'][5]=None
test_data['member_id'][7]=None
test_data['member_id'][8]=None
test_data['member_id'][9]=None
test_data['member_id'][10]=None
代码:
# calculate the groups where respondent have the member_id
d_nn = test_data[test_data.member_id.notnull()]
# or for example: test_data[test_data.member_id != 0]
d_is_n = test_data[test_data.member_id.isnull()]
d_nn = pd.DataFrame({'count' : d_nn.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_is_n = pd.DataFrame({'count' : d_is_n.groupby( [ "respondent_id","group_number"] ).size()}).reset_index()
d_nn['is_member'] = 1
d_is_n['is_member'] = 0
# merge
result = d_nn.copy()
for idx1 in range(len(d_is_n)):
merge = True
for idx2 in range(len(d_nn)):
if d_nn.iloc[idx2].respondent_id == d_is_n.iloc[idx1].respondent_id and \
d_nn.iloc[idx2].group_number == d_is_n.iloc[idx1].group_number:
merge = False
if merge:
temp_d = d_is_n.iloc[idx1]
result = result.append(temp_d, ignore_index=True)
#group by respondent_id and is_member
result = pd.DataFrame({'group_number' : result.groupby( [ "respondent_id", "is_member"] ).size()}).reset_index()
print result
答案 1 :(得分:1)
所以,这就是我最终做的事情。也许不理想,但似乎有效。 :)
import pandas as pd
rg = pd.read_csv('./in_file.csv')
rg_g = rg.groupby(['respondent_id','group_number'])
in_g = rg_g.filter(lambda g: g.respondent_id in g.id.values)
out_g = rg_g.filter(lambda g: g.respondent_id not in g.id.values)
my_count = in_g.groupby('respondent_id').group_number.nunique()
other_count = out_g.groupby('respondent_id').group_number.nunique()
pd.concat([my_count,other_count], axis=1).to_csv('./out_file.csv')