基于来自另一个数据框熊猫的匹配值的新列

时间:2019-03-13 13:31:36

标签: python pandas dataframe merge

在下面的示例中,如果我们有两个数据帧,例如df1df2;我们如何合并它们以生成df3

import pandas as pd
import numpy as np

data1 = [("a1",["A","B"]),("a2",["A","B","C"]),("a3",["B","C"])]
df1 = pd.DataFrame(data1,columns = ["column1","column2"])
print df1

data2 = [("A",["1","2"]),("B",["1","3","4"]),("C",["5"])]
df2 = pd.DataFrame(data2,columns=["column3","column4"])
print df2

data3 = [("a1",["A","B"],["1","2","3","4"]),("a2",["A","B","C"], 
["1","2","3","4","5"]),("a3",["B","C"],["1","3","4","5"])]
df3 = pd.DataFrame(data3,columns = ["column1","column2","column5"])
print df3

我的目标是不使用循环,因为我正在处理大型数据集

3 个答案:

答案 0 :(得分:7)

stack重新创建后,再用DataFrame来检查map df1的列表列


另外,由于您要求不使用for循环,因此我正在使用df2,在这种情况下,sumsum*for loop*慢得多


itertools

正如我提到的,我们大多数人都建议,您也可以使用For loops with pandas - When should I care?

进行检查
s=pd.DataFrame(df1.column2.tolist()).stack()
df1['New']=s.map(df2.set_index('column3').column4).sum(level=0).apply(set)
df1
Out[36]: 
  column1    column2              New
0      a1     [A, B]     {2, 4, 3, 1}
1      a2  [A, B, C]  {3, 5, 4, 2, 1}
2      a3     [B, C]     {4, 3, 1, 5}

答案 1 :(得分:2)

您可以按照以下步骤进行操作:

df2_dict = {i:j for i,j in zip(df2['column3'].values, df2['column4'].values)}
# print(df2_dict)

def func(val):
    return sorted(list(set(np.concatenate([df2_dict.get(i) for i in val]))))

df1['column5'] = df1['column2'].apply(func)
print(df1)

输出:

  column1    column2          column5
0      a1     [A, B]     [1, 2, 3, 4]
1      a2  [A, B, C]  [1, 2, 3, 4, 5]
2      a3     [B, C]     [1, 3, 4, 5]

答案 2 :(得分:0)

这有效:

df1['column2'].apply(lambda x: list(set((np.concatenate([df2.set_index('column3')['column4'][i] for i in list(x)])) )))