我有两个带有列表的数据框,我需要比较它们以了解一行是否与另一行相同,然后添加它。
这两个数据集是:
数据集 1
sex age zip_code
0 F 46 08204 [6, 40, 9, 44, 30]
1 F 40 08205 [33, 41, 48, 50, 30]
2 F 46 08206 [32, 33, 4, 37, 43, 21]
3 F 60 08304 [37, 39, 7, 42, 11, 49]
4 M 24 08507 [32, 15, 23, 25, 29]
5 M 42 08917 [9, 42, 41, 13, 45, 23]
6 F 50 08921 [10, 50, 29, 52]
7 F 37 10627 [41, 3, 29, 39]
数据集 2
user_id weeks blood_levels
0 1101 [15, 23, 25, 29, 32] [126, 127, 120, 111, 107]
1 1122 [33, 41, 48, 50, 30] [70, 72, 69, 68, 74, 76, 72]
2 2112 [4, 10, 11, 16, 17, 18, 24, 31, 33, 36] [117, 96, 114, 99, 119, 100, 93, 87, 80, 74]
3 2200 [3, 5, 7, 14, 21, 22, 27, 28, 34] [152, 126, 165, 169, 167, 169, 140, 154, 157]
我要比较的是,数据集 2 中的列“周”的列表位于数据集 1(全部)中的“周”列的列表中。如果这是真的,则数据集 2 中的“user_id”和“blood_levels”列应添加到数据集 1 中。因此,预期输出:
sex age zip_code weeks user_id blood_levels
0 F 46 08204 [6, 40, 9, 44, 30] 1101 [126, 127, 120, 111, 107]
1 F 40 08205 [33, 41, 48, 50, 30] 1122 [70, 72, 69, 68, 74, 76, 72]
2 F 46 08206 [32, 33, 4, 37, 43, 21]
3 F 60 08304 [37, 39, 7, 42, 11, 49]
4 M 24 08507 [32, 15, 23, 25, 29]
5 M 42 08917 [9, 42, 41, 13, 45, 23]
6 F 50 08921 [10, 50, 29, 52]
7 F 37 10627 [41, 3, 29, 39]
请注意,并非数据集 2 中的所有周都在数据集 1 中。 到目前为止我尝试过的是:
df_1['intersection'] = [list(set(a).intersection(set(b))) for a, b in zip(df_one.weeks, df_two.weeks)]
但是它给出了不同长度的错误。
添加的示例数据:
df1 = {'sex': {0: 'F', 1: 'F', 2: 'F', 3: 'F', 4: 'M'},
'age': {0: 46, 1: 40, 2: 46, 3: 60, 4: 24},
'zip_code': {0: '08204', 1: '08205', 2: '08206', 3: '08304', 4: '08507'},
'weeks': {0: [6, 40, 9, 44, 30],
1: [33, 41, 48, 50, 30],
2: [1, 2, 6, 10, 13, 19, 25, 26, 27, 29],
3: [4, 10, 11, 16, 17, 18, 24, 31, 33, 36],
4: [32, 15, 23, 25, 29]}}
df2 = {'user_id': {0: 1101, 1: 1122, 2: 2112, 3: 2200, 4: 3010},
'weeks': {0: [15, 23, 25, 29, 32],
1: [8, 9, 12, 20, 25, 30, 35],
2: [4, 10, 11, 16, 17, 18, 24, 31, 33, 36],
3: [3, 5, 7, 14, 21, 22, 27, 28, 34],
4: [1, 2, 6, 10, 13, 19, 25, 26, 27, 29]},
'blood_levels': {0: [126, 127, 120, 111, 107],
1: [70, 72, 69, 68, 74, 76, 72],
2: [117, 96, 114, 99, 119, 100, 93, 87, 80, 74],
3: [152, 126, 165, 169, 167, 169, 140, 154, 157],
4: [99, 82, 74, 69, 91, 96, 96, 68, 78, 89]}}
你能帮我实现这个目标吗?
谢谢
答案 0 :(得分:1)
经过一些讨论解决方案是:
#cross join
df = df1.assign(a=1).merge(df2.assign(a=1), on='a')
#columns to sets
df['weeks_x'] = df['weeks_x'].apply(set)
df['weeks_y'] = df['weeks_y'].apply(set)
#compared columns
df = df[df['weeks_x'].eq(df['weeks_x'])]
#intersection
df['intersection']=[list((a).intersection(b)) for a, b in zip(df['weeks_x'], df['weeks_y'])]