在下面的代码中,我要识别并报告在Col2中出现的Col1中的值,在Col1中出现的Col2中的值以及出现多次的总体值。
在下面的示例中,值AAPL和GOOG出现在Col1和Col2中。预计将在接下来的2列中识别并报告这些,然后在随后的列中期望识别并报告Col1或Col2值中的“任何”是DUP。
{{1}}
答案 0 :(得分:1)
这里是与您的代码一起使用的解决方案。它只使用一些与itterows()的循环。没什么。
df['Col3'] = False
df['Col4'] = False
df['Col5'] = False
for i,row in df.iterrows():
if df.loc[i,'Col1'] in (df.Col2.values):
df.loc[i,'Col3'] = True
for i,row in df.iterrows():
if df.loc[i,'Col2'] in (df.Col1.values):
df.loc[i,'Col4'] = True
for i,row in df.iterrows():
if df.loc[i,'Col3'] | df.loc[i,'Col4'] == True:
df.loc[i,'Col5'] = True
答案 1 :(得分:1)
使用numpy where
检查一个列值是否在另一个列值中,然后对这些列进行布尔OR运算以检查它是否为重复项。
df['Col1inCol2']=np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2inCol1']=np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe']= df.Col1inCol2 | df.Col2inCol1
Col1 Col2 Col1inCol2 Col2inCol1 Dupe
0 AAPL GOOG True True True
1 NaN IBM False False False
2 GOOG MSFT True False True
3 MMM NaN False False False
4 NaN GOOG False True True
5 INTC AAPL False True True
6 FB VZ False False False
答案 2 :(得分:0)
以下是最终脚本:
##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 04-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################
import pandas as pd
import numpy as np
data={'Col1':['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],'Col2':['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("Initial DataFrame\n")
print (df)
pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
df['Col1_val_exists_in_Col2'] = False
df['Col2_val_exists_in_Col1'] = False
df['Dup_in_Frame'] = False
for i,row in df.iterrows():
if df.loc[i,'Col1'] in (df.Col2.values):
df.loc[i,'Col1_val_exists_in_Col2'] = True
for i,row in df.iterrows():
if df.loc[i,'Col2'] in (df.Col1.values):
df.loc[i,'Col2_val_exists_in_Col1'] = True
for i,row in df.iterrows():
if df.loc[i,'Col1_val_exists_in_Col2'] | df.loc[i,'Col2_val_exists_in_Col1'] == True:
df.loc[i,'Dup_in_Frame'] = True
print ("Final DataFrame\n")
print (df)
答案 3 :(得分:0)
下面提供了另一种完成任务的方法-感谢“ skrubber”:
##############################################################################
# Code to identify and report duplicates across columns
# np.nan values are handled
# Date: 05-JUL-2018
# Posted by: Salil V Gangal
# Forum: Stack OverFlow
##############################################################################
import pandas as pd
import numpy as np
data={
'Col1':
['AAPL', np.nan, 'GOOG', 'MMM', np.nan, 'INTC', 'FB'],
'Col2':
['GOOG', 'IBM', 'MSFT', np.nan, 'GOOG', 'AAPL', 'VZ']
}
df=pd.DataFrame(data,columns=['Col1','Col2'])
print ("\n\nInitial DataFrame\n")
print (df)
pd.set_option("display.max_rows",999)
pd.set_option("display.max_columns",999)
df['Col1_val_exists_in_Col2'] = np.where(df.Col1.isin(df.Col2) & ~df.Col1.isnull(), True, False)
df['Col2_val_exists_in_Col1'] = np.where(df.Col2.isin(df.Col1) & ~df.Col2.isnull(), True, False)
df['Dupe'] = df.Col1_val_exists_in_Col2 | df.Col2_val_exists_in_Col1
print ("\n\nFinal DataFrame\n")
print (df)
Initial DataFrame
Col1 Col2
0 AAPL GOOG
1 NaN IBM
2 GOOG MSFT
3 MMM NaN
4 NaN GOOG
5 INTC AAPL
6 FB VZ
Final DataFrame
Col1 Col2 Col1_val_exists_in_Col2 Col2_val_exists_in_Col1 Dupe
0 AAPL GOOG True True True
1 NaN IBM False False False
2 GOOG MSFT True False True
3 MMM NaN False False False
4 NaN GOOG False True True
5 INTC AAPL False True True
6 FB VZ False False False