我有一个数据框A
,如下所示:
+---+------------+-----------+-----------+-----+-------+
| | time | uid | o_uid | msg | count |
+---+------------+-----------+-----------+-----+-------+
| 0 | 1433131357 | 191470529 | 191159572 | eis | 1 |
| 1 | 1433131410 | 191458009 | 160429326 | eis | 1 |
| 2 | 1433131504 | 191470523 | 153734142 | eis | 1 |
| 3 | 1433131685 | 191470551 | 191470546 | eis | 1 |
| 4 | 1433131782 | 191470565 | 187367195 | eis | 1 |
+---+------------+-----------+-----------+-----+-------+
另一个数据框B
,如下所示:
+---+------------+-----------+-------+
| | time | uid | count |
+---+------------+-----------+-------+
| 0 | 1433131967 | 191470529 | 1 |
| 1 | 1433132503 | 191466638 | 1 |
| 2 | 1433139333 | 191451858 | 1 |
| 3 | 1433141249 | 191470551 | 1 |
| 4 | 1433143867 | 191471209 | 1 |
+---+------------+-----------+-------+
我想要做的是从B
获取所有时间戳,并将它们放在A
匹配的UIDs
列中。如果没有匹配,则应该有NaN
我试过了:df = pd.merge(A, B, left_on='uid', right_on='uid', how='outer')
但我认为它只会将B
附加到A
的底部。它没有按预期工作。
答案 0 :(得分:2)
我认为left_join
最适合你的情况。这可以通过设置how=left
import pandas as pd
# your data
# ============================
print(df_A)
Out[33]:
time uid o_uid msg count
0 1433131357 191470529 191159572 eis 1
1 1433131410 191458009 160429326 eis 1
2 1433131504 191470523 153734142 eis 1
3 1433131685 191470551 191470546 eis 1
4 1433131782 191470565 187367195 eis 1
print(df_B)
Out[35]:
time uid count
0 1433131967 191470529 1
1 1433132503 191466638 1
2 1433139333 191451858 1
3 1433141249 191470551 1
4 1433143867 191471209 1
# processing
# ============================
df = pd.merge(df_A, df_B, left_on='uid', right_on='uid',how='left', suffixes=['_A', '_B'])
Out[45]:
time_A uid o_uid msg count_A time_B count_B
0 1433131357 191470529 191159572 eis 1 1.4331e+09 1
1 1433131410 191458009 160429326 eis 1 NaN NaN
2 1433131504 191470523 153734142 eis 1 NaN NaN
3 1433131685 191470551 191470546 eis 1 1.4331e+09 1
4 1433131782 191470565 187367195 eis 1 NaN NaN
答案 1 :(得分:1)
两个DataFrame中的时间和计数列都重叠,因此您需要提供suffixes
作为参数。在下面的示例中,我使用df_a
和'_b'
的空后缀作为df_b
的后缀。
import pandas as pd
df_a = pd.DataFrame({'count': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'msg': {0: 'eis', 1: 'eis', 2: 'eis', 3: 'eis', 4: 'eis'},
'o_uid': {0: 191159572, 1: 160429326, 2: 153734142, 3: 191470546, 4: 187367195},
'time': {0: 1433131357, 1: 1433131410, 2: 1433131504, 3: 1433131685, 4: 1433131782},
'uid': {0: 191470529, 1: 191458009, 2: 191470523, 3: 191470551, 4: 191470565}})
df_b = pd.DataFrame({'count': {0: 1, 1: 1, 2: 1, 3: 1, 4: 1},
'time': {0: 1433131967, 1: 1433132503, 2: 1433139333, 3: 1433141249, 4: 1433143867},
'uid': {0: 191470529, 1: 191466638, 2: 191451858, 3: 191470551, 4: 191471209}})
>>> df_a.merge(df_b, how='outer', on='uid', suffixes=['', '_b'])
count msg o_uid time uid count_b time_b
0 1 eis 191159572 1433131357 191470529 1 1433131967
1 1 eis 160429326 1433131410 191458009 NaN NaN
2 1 eis 153734142 1433131504 191470523 NaN NaN
3 1 eis 191470546 1433131685 191470551 1 1433141249
4 1 eis 187367195 1433131782 191470565 NaN NaN
5 NaN NaN NaN NaN 191466638 1 1433132503
6 NaN NaN NaN NaN 191451858 1 1433139333
7 NaN NaN NaN NaN 191471209 1 1433143867