我正在尝试从重复项中提取信息。
data = np.array([[100,1,0, 'GB'],[100,0,1, 'IT'],[101,1,0, 'CN'],[101,0,1, 'CN'],
[102,1,0, 'JP'],[102,0,1, 'CN'],[103,0,1, 'DE'],
[103,0,1, 'DE'],[103,1,0, 'VN'],[103,1,0, 'VN']])
df = pd.DataFrame(data, columns = ['wed_cert_id','spouse_1',
'spouse_2', 'nationality'])
我想将每场婚礼归为跨国婚礼。 在我的实际数据集中,婚姻中可能有超过2个配偶。
或类似这样:
我试图找到一种方法来使用.duplicated()过滤数据,并尝试使用not运算符拒绝.duplicated(),但是并没有成功解决:
df = df.loc[df.wed_cert_id.duplicated(keep=False) ~df.nationality.duplicated(keep=False), :]
df = df.loc[df.wed_cert_id.duplicated(keep=False) not df.nationality.duplicated(keep=False), :]
删除重复项会导致过多的观察结果。我的数据集允许每场婚礼有超过2个配偶,这有可能导致重复:
df.drop_duplicates(subset=['wed_cert_id','nationality'], keep=False, inplace=True)
我该怎么做?
从现在开始非常感谢
答案 0 :(得分:1)
我相信您需要:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1'))
print(df)
或者:
df['cross_national'] = (df.groupby('wed_cert_id')['nationality']
.transform('nunique').gt(1).view('i1')
.mul(df[['spouse_1','spouse_2']].prod(1)))
print(df)
wed_cert_id spouse_1 spouse_2 nationality cross_national
0 100 1 0 GB 1
1 100 0 1 IT 1
2 101 1 0 CN 0
3 101 0 1 CN 0
4 102 1 0 JP 1
5 102 0 1 CN 1
6 103 0 1 DE 1
7 103 0 1 DE 1
8 103 1 0 VN 1
9 103 1 0 VN 1