我想比较column1和column2并获取导致从column1检测到差异的唯一值。因此,在这种情况下,我应该得到的答案是“ Residence-Location”,“-12”,“ NAN”和“ NA”(空白)。它将第一列与第二列进行比较
还可以创建结果并将其存储在另一列中吗?
Result
index column1 column2 diff
1. Admission Date Residence - Location Residence - Location
2. Malnutrition Malnutrition-12 -12
3. TB NAN NAN
4. Anaemia NA NA
代码可以使用R或Python。我不介意
def FindDifference(Row):
x = Row['column1']
y = Row['column2']
Difference = ""
if pd.isnull(y) or y=="nan" or y=="NA":
return NaN
if len(x) <= len(y):
for i in y:
if i not in x:
Difference += str(i)
else:
for i in x:
if i not in y:
Difference += str(i)
return Difference
ReadDataT = Final_df[['column1','column2']]
ReadDataT['diff']= ReadDataT.apply(lambda x: FindDifference(x),axis=1)
ReadDataT
此代码的问题是比较两个字符之间的每个字符并给出不仅在两列中的字符结果...就像第一行将'RC-Lc'作为差异
答案 0 :(得分:3)
library(dplyr); library(stringr)
df %>% mutate(diff = str_remove(column2, column1))
index column1 column2 diff
1 1 Admission Date Residence - Location Residence - Location
2 2 Malnutrition Malnutrition-12 -12
3 3 TB NAN NAN
4 4 Anaemia <NA> <NA>
编辑:不使用dplyr
df$diff = stringr::str_remove(df$column2, df$column1)
答案 1 :(得分:1)
对于Python:
df = df.replace(np.nan, '', regex = True)
df['diff'] = df.apply(lambda x: x['column2'].replace(x['column1'], '').strip(), axis = 1)
df = df.replace('', np.nan, regex = True)
输出:
column1 column2 diff
0 Admission Date Residence - Location Residence - Location
1 Malnutrition Malnutrition-12 -12
2 TB NaN NaN
3 Anaemia NaN NaN
答案 2 :(得分:0)
在基数R中,我们可以将sub
与mapply
一起使用
df$diff <- mapply(function(x, y) sub(x, "", y), df$column1, df$column2)
df
# index column1 column2 diff
#1 1 Admission Date Residence - Location Residence - Location
#2 2 Malnutrition Malnutrition-12 -12
#3 3 TB NAN NAN
#4 4 Anaemia <NA> <NA>