我有2个数据框df1
和df2
,每个数据框都有一列。我想逐行比较它们,如果列匹配的值匹配所有匹配值的新数据框。如果没有为不匹配的值创建一个包含2列的数据框。
equal_index = []
equal_df1 = []
not_equal_index = []
not_equal_df1 = []
not_equal_df2 = []
for x in df1.index.tolist():
if df1['column'].ix[x] == df2['column'].ix[x]:
equal_index.append(x)
equal_df1.append(df1['column'].ix[x])
else:
not_equal_index.append(x)
not_equal_df1.append(df1['column'].ix[x])
not_equal_df2.append(df2['column'].ix[x])
DF_equal = pd.DataFrame({"column":equal_df1}, index = equal_index )
DF_not_equal = pd.DataFrame({'column1':not_equal_df1,'column2':not_equal_df2}, index = not_equal_index)
它似乎应该有效,但我收到错误:ValueError: the truth value of a Series is ambiguous.
如果我尝试一些基本的
for x in df1.index.tolist():
print df1['column'].ix[x] == df2['column'].ix[x]
我得到True
或False
与我一样多的x
如果我使用is
代替==
,我会获得DF_not_equal
中的所有值。
答案 0 :(得分:1)
import pandas as pd
df = pd.DataFrame({0:['test','test2','test3'], 1:['foo','foo2','foo3'], 2:['bar','bar2', 'bar3']})
df2 = pd.DataFrame({0:['test','test2','test4'], 1:['foo','foo2','foo3'], 2:['bar','bar2', 'bar3']})
df_equal = pd.DataFrame()
df_not_equal = pd.DataFrame()
for i in range(df.shape[0]):
if all(df.loc[i].values==df2.loc[i].values):
df_equal = df_equal.append(df.loc[i], ignore_index=True)
else:
df_not_equal['A'] = df.loc[i]
df_not_equal['B'] = df2.loc[i]
print(df_equal)
print(df_not_equal)
会给你:
0 1 2
0 test foo bar
1 test2 foo2 bar2
A B
0 test3 test4
1 foo3 foo3
2 bar3 bar3
查看您提供的错误,问题出在if df1['column'].ix[x] == df2['column'].ix[x]:
。 pandas
不允许这种类型的比较,这就是他们说“含糊不清”的原因。
通过使用行的实际值来修复此问题,您需要检查all
或any
(我希望这些是自我解释的)。