我想根据另一个数据帧中的数据从一个数据框中删除数据。 我找到了一种方法(见下文),但我想知道是否有更有效的方法。 这是我想要改进的代码:
# -*- coding: utf-8 -*-
import pandas as pd
#df1 is the dataframe where I want to remove data from
d1 = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.], 'three' : [5.,6.,7.,8.] }
df1 = pd.DataFrame(d1)
df1.columns = ['one', 'two', 'three'] #Keeping the order of the columns as defined
print 'df1\n', df1
#print df1
#I want to remove all the rows from df1 that are also in df2
d2 = {'one' : [2., 4.], 'two' : [3., 1], 'three' : [6.,8.] }
df2 = pd.DataFrame(d2)
df2.columns = ['one', 'two', 'three'] #Keeping the order of the columns as defined
print 'df2\n', df2
#df3 is the output I want to get: it should have the same data as df1, but without the data that is in df2
df3 = df1
#Create some keys to help identify rows to be dropped from df1
df1['key'] = df1['one'].astype(str)+'-'+df1['two'].astype(str)+'-'+df1['three'].astype(str)
print 'df1 with key\n', df1
df2['key'] = df2['one'].astype(str)+'-'+df2['two'].astype(str)+'-'+df2['three'].astype(str)
print 'df2 with key\n', df2
#List of rows to remove from df1
rowsToDrop = []
#Building the list of rows to drop
for i in df1.index:
if df1['key'].irow(i) in df2.ix[:,'key'].values:
rowsToDrop.append(i)
#Dropping rows from df1 that are also in df2
for j in reversed(rowsToDrop):
df3 = df3.drop(df3.index[j])
df3.drop(['key'], axis=1, inplace=True)
#Voilà!
print 'df3\n', df3
感谢您的帮助。
答案 0 :(得分:1)
这将使用数据框df1和dict d2
df3 = df1[~df1.isin(d2)].dropna()
您可以将df传递给isin(),但我不认为您会给出您正在寻找的结果,因为我相信它也会查看索引。
http://pandas.pydata.org/pandas-docs/dev/generated/pandas.DataFrame.isin.html
答案 1 :(得分:0)
您正在寻找更多选择行的语法,而不是加入数据帧。'
真正的左连接看起来像这样:
import numpy as np
import pandas as pd
d1 = {'one' : [1., 2., 3., 4.], 'two' : [4., 3., 2., 1.], 'three' : [5.,6.,7.,8.] }
df1 = pd.DataFrame(d1)
df1['key'] = df1['one'].astype(str)+'-'+df1['two'].astype(str)+'-'+df1['three'].astype(str)
df1.set_index('key', inplace=True)
d2 = {'one' : [2., 4.], 'two' : [3., 1], 'three' : [6.,8.] }
df2 = pd.DataFrame(d2)
df2['key'] = df2['one'].astype(str)+'-'+df2['two'].astype(str)+'-'+df2['three'].astype(str)
df2.set_index('key', inplace=True)
df1.join(df2, how='left', lsuffix='_df1', rsuffix='_df2')
one_df1 three_df1 two_df1 one_df2 three_df2 two_df2
key
1.0-4.0-5.0 1 5 4 NaN NaN NaN
2.0-3.0-6.0 2 6 3 2 6 3
3.0-2.0-7.0 3 7 2 NaN NaN NaN
4.0-1.0-8.0 4 8 1 4 8 1
进行正确的加入:
df1.join(df2, how='right', lsuffix='_df1', rsuffix='_df2')
产生这个:
one_df1 three_df1 two_df1 one_df2 three_df2 two_df2
key
2.0-3.0-6.0 2 6 3 2 6 3
4.0-1.0-8.0 4 8 1 4 8 1