将两个pandas数据帧与列中的列表进行比较

时间:2016-08-25 15:14:14

标签: python pandas

我有两个数据帧df1和df2:

df1 : 
Name A_list
abcd (apple,orange,banana)
bcde (orange,mango)
cdef (apple,pineapple)

df2 :
City B_list
C1   (apple,mango,banana)
C2   (mango)
C3   (pineapple,banana)

我想创建一个新的数据帧df3

Name A_list City
abcd (apple,orange,banana) (C1,C3)
bcde (orange,mango) (C1,C2)
cdef (apple,pineapple) (C1,C3)

即通过Df1中的A_list并确定每个水果来自哪个城市。 我不知道如何使用列表A_list和B_list

合并df1和df2

1 个答案:

答案 0 :(得分:2)

设置

df1 = pd.DataFrame([
        ['abcd', ('apple', 'orange', 'banana')],
        ['bcde', ('orange', 'mango')],
        ['cdef', ('apple', 'pineapple')]
    ], columns=['Name', 'A_list'])
df2 = pd.DataFrame([
        ['C1', ('apple', 'mango', 'banana')],
        ['C2', ('mango')],
        ['C3', ('pineapple', 'banana')]
    ], columns=['City', 'B_list'])

按摩数据

s2 = df2.set_index('City').squeeze() \
    .apply(pd.Series) \
    .stack().reset_index(1, drop=True)

s2

City
C1        apple
C1        mango
C1       banana
C2        mango
C3    pineapple
C3       banana
dtype: object
s1 = df1.set_index('Name').squeeze() \
    .apply(pd.Series) \
    .stack().reset_index(1, drop=True)

s1

Name
abcd        apple
abcd       orange
abcd       banana
bcde       orange
bcde        mango
cdef        apple
cdef    pineapple
dtype: object
df3 = pd.merge(*[s.rename('fruit').reset_index() for s in [s1, s2]])

df3

enter image description here

def tuplify(series):
    return tuple(set(series))

df3.groupby('Name') \
    .apply(lambda df: df.drop('Name', axis=1).apply(tuplify)) \
    .rename(columns=dict(fruit='A_list')).reset_index()

enter image description here

请注意'orange'缺失,因为它没有'City'表示。如果您想要相同的A_list

df3 = pd.merge(*[s.rename('fruit').reset_index() for s in [s1, s2]])
df3 = df3.groupby('Name') \
    .apply(lambda df: df.drop('Name', axis=1).apply(tuplify)) \
    .rename(columns=dict(fruit='A_list'))

df3['A_list'] = df1.set_index('Name')['A_list']
df3.reset_index()

enter image description here