我试图分析我与Pandas的数据趋势。我有两个表,如果该行中的UID和PID存在于另一个表中,我想在一个中创建一个新的二进制列。我目前拥有的表格的一个例子是:
>>> df_a = pd.DataFrame({"UID": [123, 456, 789, 012], "PID": [12, 55, 56, 89], "TIM": [76, 54, 21, 25]})
>>> df_a
PID TIM UID
0 12 76 123
1 55 54 456
2 56 21 789
3 89 25 010
>>> df_b = pd.DataFrame({'UID': [221, 012, 653, 456], 'PID': [17, 89, 51, 55], 'FOO': [2347, 32447, 3234, 7999]})
>>> df_b
FOO PID UID
0 2347 17 221
1 32447 89 010
2 3234 51 653
3 7999 55 456
我希望最终结果是:
>>> df_a
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 010 1
但我不知道该如何去做。我认为left join
是可行的方式,但我也无法将其拉下来。任何帮助将不胜感激
答案 0 :(得分:3)
您可以使用join
或merge
的左连接,然后测试 a = 0 ; b = 1
curve( dnorm(x, mean = a, sd = b ), -4, 4, axes = F, ann = F)
xx <- -4:4
yy <- dnorm(xx, mean = a, sd = b)
text(xx, yy, paste(round(yy, 2) ), font = 2 )
列,如果不是FOO
到NaN
,则转换为boolean mask
} astype
:
0,1
df_a['PUR'] = df_a.join(df_b.set_index(['PID','UID']), on=['PID','UID'])['FOO']
.notnull().astype(int)
print (df_a)
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 12 1
另一个解决方案是isin
测试:
df_a['PUR'] = pd.merge(df_a, df_b, how='left', on=['PID','UID'])['FOO'].notnull().astype(int)
print (df_a)
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 12 1
编辑:
两列似乎都需要drop_duplicates
:
df_a['PUR'] = df_a.set_index('PID')['UID'].isin(df_b.set_index('PID')['UID'])
.astype(int).values
print (df_a)
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 12 1
答案 1 :(得分:2)
merge
的 indicator=True
几乎可以帮助您
df_a.merge(df_b[['PID', 'UID']], how='left', indicator=True)
PID TIM UID _merge
0 12 76 123 left_only
1 55 54 456 both
2 56 21 789 left_only
3 89 25 012 both
使用map
m = dict(left_only=0, both=1)
df_a.assign(
PUR=df_a.merge(df_b[['PID', 'UID']], how='left', indicator=True)._merge.map(m))
PID TIM UID PUR
0 12 76 123 0
1 55 54 456 1
2 56 21 789 0
3 89 25 012 1
答案 2 :(得分:0)
你可以用左连接来做,但如果你想得到那个结果会有点奇怪。
df_b['PUR'] = 1
df_a = pd.merge(df_a, df_b, how='left', on=['PID', 'UID'])
df_a['PUR'] = df_a['PUR'].apply(lambda x: 1 if pd.notnull(x) else 0)
df_a = df_a.drop('FOO', axis=1)
我建议使用纯apply
代替:
df_a['PUR'] = df_a.apply(lambda x: int(x['UID'] in df_b['UID'].values or
x['PID'] in df_b['PID'].values), axis=1)
答案 3 :(得分:0)
你可以使用numpy的in1d()。您可以使用它来实现其他情况,左连接可能会失败
import pandas as pd
import numpy as np
df_a = pd.DataFrame({"UID": [123, 456, 789, 012], "PID": [12, 55, 56, 89], "TIM": [76, 54, 21, 25]})
df_b = pd.DataFrame({'UID': [221, 012, 653, 456], 'PID': [17, 89, 51, 55], 'FOO': [2347, 32447, 3234, 7999]})
UID_a = df_a['UID'].values
UID_b = df_b['UID'].values
PID_a = df_a['PID'].values
PID_b = df_b['PID'].values
x = np.in1d(UID_a, UID_b)
y = np.in1d(PID_a, PID_b)
PUR = x + y
df_a['PUR'] = PUR
df_b['PUR'] = PUR