将两个DataFrame合并为一些相等的列

时间:2014-07-14 21:52:17

标签: python csv pandas

我有两个csv文件:

1.csv

id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4

2.csv

id,noteId,text,other
id3,idNote10,new text 3,On1
id2,idNote19,My new text 2,Pre8

加载它们如:

>>> df1 = pd.read_csv('1.csv', encoding='utf-8').set_index('id')
>>> df2 = pd.read_csv('2.csv', encoding='utf-8').set_index('id')
>>>
>>> print df1
       noteId                   text
id
id2  idNote19  This is my old text 2
id5  idNote13  This is my old text 5
id1  idNote12  This is my old text 1
id3  idNote10  This is my old text 3
id4  idNote11  This is my old text 4
>>> print df2
        noteId            text other
id
id3   idNote10      new text 3   On1
id2   idNote19   My new text 2  Pre8
id5        NaN   My new text 2   Hl0
id22  idNote22  My new text 22    M1

我需要合并这两个DataFrames(df1上的写操作值在df2上为空,添加额外的列和df1上不存在的行):

        noteId                   text other
id
id2   idNote19          My new text 2  Pre8
id5        NaN          My new text 2   Hl0
id1   idNote12  This is my old text 1   NaN
id3   idNote10             new text 3   On1
id4   idNote11  This is my old text 4   NaN
id22  idNote22         My new text 22    M1

我的真实DataFrames还有其他应合并的列,而不仅仅是text

我尝试使用merge获取类似的内容:

>>> df1 = pd.read_csv('1.csv', encoding='utf-8')
>>> df2 = pd.read_csv('2.csv', encoding='utf-8')
>>>
>>> print df1
    id    noteId                   text
0  id2  idNote19  This is my old text 2
1  id5  idNote13  This is my old text 5
2  id1  idNote12  This is my old text 1
3  id3  idNote10  This is my old text 3
4  id4  idNote11  This is my old text 4
>>> print df2
    id    noteId           text
0  id3  idNote10     new text 3
1  id2  idNote19  My new text 2
>>>
>>> print merge(df1, df2, how='left', on=['id'])
    id  noteId_x                 text_x  noteId_y         text_y
0  id2  idNote19  This is my old text 2  idNote19  My new text 2
1  id5  idNote13  This is my old text 5       NaN            NaN
2  id1  idNote12  This is my old text 1       NaN            NaN
3  id3  idNote10  This is my old text 3  idNote10     new text 3
4  id4  idNote11  This is my old text 4       NaN            NaN
>>>

但这不是我需要的。我不知道我是否在正确的道路上并且应该合并后缀列,或者是否有更好的方法来执行此操作。

谢谢!

更新 在df1上添加了在df2上为空的ovewriting值,在df2上添加了额外的列,这些列应该在&#34之后的df1上出现; merge"和应该附加在df1上的行

-

根据@ U2EF1(谢谢!)评论,我找到了解决方案:

df1.fillna(value='None', inplace=True)
df2.fillna(value='None', inplace=True)

concat([df1, df2]).groupby('id').last().fillna(value='None')

就我而言,定义默认"空"非常重要。价值,这就是fillna

的原因

2 个答案:

答案 0 :(得分:3)

编辑更新添加行,列和更新数据,有效合并索引

使用df2数据更新df1的代码......

    df1 = """id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4"""

df2 ="""id,noteId,text,other
id3,idNote10,My new text 3,On1
id2,idNote19,My new text 2,Pre8
id5,NaN,My new text 2,Hl0
id22,idNote22,My new text 22,M1"""


df1 = pd.read_csv(StringIO.StringIO(df1),sep=",",index_col='id')#this is how you should
df2 = pd.read_csv(StringIO.StringIO(df2),sep=",",index_col='id')#set your index in read_csv not after

** **解

df = pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
#joined on indexes for speed

<强>输出

>>print df

        noteId                   text other
id                                         
id1   idNote12  This is my old text 1   NaN
id2   idNote19          My new text 2  Pre8
id22  idNote22         My new text 22    M1
id3   idNote10             new text 3   On1
id4   idNote11  This is my old text 4   NaN
id5        NaN          My new text 2   Hl0

合理的原因......

pd.merge有几个多用途参数。实际上,当left_index和right_index键设置为False时,on键实际上仅用于连接两个数据帧 - 默认值。否则,它将只加入从on值中找到的具有相同名称的列。在这种情况下,两列'text'和'noteId'。 (我使用df1.columns.tolist()作为参数使其更通用 - 这意味着df2中任何具有相同名称的列都将覆盖df1中的数据,而不是将其标记为text_y)

使用更通用的on键(df1.values.tolist())你可以实际循环通过一堆csvs来更新数据帧中的数据

**比接受的解决方案快3倍**

In [25]: %timeit       pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
1000 loops, best of 3: 1.11 ms per loop

已接受的解决方案

In [30]: %timeit pd.concat([df1, df2]).groupby('noteId').last().fillna(value='None')
100 loops, best of 3: 3.29 ms per loop

答案 1 :(得分:2)

通常你可以用适当的索引解决这个问题:

df1.set_index(['id', 'noteId'], inplace=True)
df1.update(df2)

(如果您之后不想要那个索引,只需df1.reset_index(inplace=True)