我有两个csv文件:
1.csv
id,noteId,text id2,idNote19,This is my old text 2 id5,idNote13,This is my old text 5 id1,idNote12,This is my old text 1 id3,idNote10,This is my old text 3 id4,idNote11,This is my old text 4
2.csv
id,noteId,text,other id3,idNote10,new text 3,On1 id2,idNote19,My new text 2,Pre8
加载它们如:
>>> df1 = pd.read_csv('1.csv', encoding='utf-8').set_index('id') >>> df2 = pd.read_csv('2.csv', encoding='utf-8').set_index('id') >>> >>> print df1 noteId text id id2 idNote19 This is my old text 2 id5 idNote13 This is my old text 5 id1 idNote12 This is my old text 1 id3 idNote10 This is my old text 3 id4 idNote11 This is my old text 4 >>> print df2 noteId text other id id3 idNote10 new text 3 On1 id2 idNote19 My new text 2 Pre8 id5 NaN My new text 2 Hl0 id22 idNote22 My new text 22 M1
我需要合并这两个DataFrames(df1上的写操作值在df2上为空,添加额外的列和df1上不存在的行):
noteId text other id id2 idNote19 My new text 2 Pre8 id5 NaN My new text 2 Hl0 id1 idNote12 This is my old text 1 NaN id3 idNote10 new text 3 On1 id4 idNote11 This is my old text 4 NaN id22 idNote22 My new text 22 M1
text
我尝试使用merge
获取类似的内容:
>>> df1 = pd.read_csv('1.csv', encoding='utf-8') >>> df2 = pd.read_csv('2.csv', encoding='utf-8') >>> >>> print df1 id noteId text 0 id2 idNote19 This is my old text 2 1 id5 idNote13 This is my old text 5 2 id1 idNote12 This is my old text 1 3 id3 idNote10 This is my old text 3 4 id4 idNote11 This is my old text 4 >>> print df2 id noteId text 0 id3 idNote10 new text 3 1 id2 idNote19 My new text 2 >>> >>> print merge(df1, df2, how='left', on=['id']) id noteId_x text_x noteId_y text_y 0 id2 idNote19 This is my old text 2 idNote19 My new text 2 1 id5 idNote13 This is my old text 5 NaN NaN 2 id1 idNote12 This is my old text 1 NaN NaN 3 id3 idNote10 This is my old text 3 idNote10 new text 3 4 id4 idNote11 This is my old text 4 NaN NaN >>>
但这不是我需要的。我不知道我是否在正确的道路上并且应该合并后缀列,或者是否有更好的方法来执行此操作。
谢谢!
更新 在df1上添加了在df2上为空的ovewriting值,在df2上添加了额外的列,这些列应该在&#34之后的df1上出现; merge"和应该附加在df1上的行
-
根据@ U2EF1(谢谢!)评论,我找到了解决方案:
df1.fillna(value='None', inplace=True) df2.fillna(value='None', inplace=True) concat([df1, df2]).groupby('id').last().fillna(value='None')
就我而言,定义默认"空"非常重要。价值,这就是fillna
。
答案 0 :(得分:3)
编辑更新添加行,列和更新数据,有效合并索引
使用df2数据更新df1的代码......
df1 = """id,noteId,text
id2,idNote19,This is my old text 2
id5,idNote13,This is my old text 5
id1,idNote12,This is my old text 1
id3,idNote10,This is my old text 3
id4,idNote11,This is my old text 4"""
df2 ="""id,noteId,text,other
id3,idNote10,My new text 3,On1
id2,idNote19,My new text 2,Pre8
id5,NaN,My new text 2,Hl0
id22,idNote22,My new text 22,M1"""
df1 = pd.read_csv(StringIO.StringIO(df1),sep=",",index_col='id')#this is how you should
df2 = pd.read_csv(StringIO.StringIO(df2),sep=",",index_col='id')#set your index in read_csv not after
df = pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
#joined on indexes for speed
<强>输出强>
>>print df
noteId text other
id
id1 idNote12 This is my old text 1 NaN
id2 idNote19 My new text 2 Pre8
id22 idNote22 My new text 22 M1
id3 idNote10 new text 3 On1
id4 idNote11 This is my old text 4 NaN
id5 NaN My new text 2 Hl0
合理的原因......
pd.merge有几个多用途参数。实际上,当left_index和right_index键设置为False时,on键实际上仅用于连接两个数据帧 - 默认值。否则,它将只加入从on值中找到的具有相同名称的列。在这种情况下,两列'text'和'noteId'。 (我使用df1.columns.tolist()作为参数使其更通用 - 这意味着df2中任何具有相同名称的列都将覆盖df1中的数据,而不是将其标记为text_y)
使用更通用的on键(df1.values.tolist())你可以实际循环通过一堆csvs来更新数据帧中的数据
In [25]: %timeit pd.merge(df2,df1,how='outer',on=df1.columns.tolist(),left_index=True,right_index=True)
1000 loops, best of 3: 1.11 ms per loop
已接受的解决方案
In [30]: %timeit pd.concat([df1, df2]).groupby('noteId').last().fillna(value='None')
100 loops, best of 3: 3.29 ms per loop
答案 1 :(得分:2)
通常你可以用适当的索引解决这个问题:
df1.set_index(['id', 'noteId'], inplace=True)
df1.update(df2)
(如果您之后不想要那个索引,只需df1.reset_index(inplace=True)
)