Question

我有两个Pandas数据帧：

import pandas as pd
a = pd.DataFrame( {'key' : [123, 234, 345, 456] } )
b = pd.DataFrame( {'key' : [     234, 345, 456, 567 ] } )

我想要做的是将它们合并为一个包含两列的数据框：一个key，两者的结合;和另一个source，两个原始数据框中包含所述密钥的列表。

对于上面的输入，我想要这个：

+---+-----+--------+
|   | key | source |
+---+-----+--------+
| 0 | 123 | [a]    |
| 1 | 234 | [a, b] |
| 2 | 345 | [a, b] |
| 3 | 456 | [a, b] |
| 4 | 567 | [b]    |
+---+-----+--------+

我有一个可行的实现，但（我想）对于大型表来说非常慢：

union = set( a.key )
union.update( b.key )
union_series = pd.Series( data=sorted(list(union)) )

def append_ifin_src( urow, acc, (name, src) ):
    acc.extend( [name] if len(src[src==urow]) != 0 else [] )
    return acc

source_series = union_series.apply( lambda urow : reduce( lambda acc, tocheck : append_ifin_src(urow, acc, tocheck), [('a', a.key), ('b', b.key)], [] ) )

pd.DataFrame( { 'key' : union_series, 'source' : source_series } )

有什么更好的方法可以做到这一点？

Answer 1

import pandas as pd  

a = pd.DataFrame( {'key' : [123, 234, 345, 456],
                  'source': ['a','a','a','a'] } )
b = pd.DataFrame( {'key' : [     234, 345, 456, 567 ],
                   'source': ['b','b','b','b']} )

df = a.merge(b, how='outer', on='key').fillna("")
df['source'] = df['source_x'] +df['source_y']
df[['key', 'source']]

向原始数据帧添加列是另一个想法......

Answer 2

“pandy”方法是首先将列提升为索引：

aa = pd.DataFrame(['a']*len(a), index=a.key, columns=['a'])
bb = pd.Series(['b']*len(b), index=b.key, name='b')

然后加入它们并计算一个新列：

aa.join(bb, how='outer')\
  .fillna('')\
  .apply(lambda x: x['a'] + x['b'], axis=1)

如果初始排序不重要，我也会尝试纯Python解决方案：

def source(key):
    if key in sa and key in sb:
        return '[a, b]'
    if key in sa:
        return '[a]'
    if key in sb:
        return '[b]'

sa = set(a.key)
sb = set(b.key)
pd.DataFrame([[key, source(key)] 
              for key in sa.union(sb)], 
              columns=['key', 'source'])     
Out[99]:
key source
0   456 [a, b]
1   234 [a, b]
2   567 [b]
3   345 [a, b]
4   123 [a]

在我的快速测试中，纯python的速度提高了6倍，但您应该检查自己的数据。

Answer 3

如果添加列不是一个选项，则可以在合并后使用np.in1d。除了你要删除空字符串外，这里有很多方法。

df = pd.merge(a, b, how='outer')
df['source'] = zip(np.where(np.in1d(df, a), 'a', ''), 
                   np.where(np.in1d(df, b), 'b', ''))

   key  source
0  123   (a, )
1  234  (a, b)
2  345  (a, b)
3  456  (a, b)
4  567   (, b)

合并两个Pandas数据帧中的列并显示每行的源

3 个答案: