查找两列比较之间的唯一字符

时间:2019-07-23 11:45:42

标签: python r dataframe

我想比较column1和column2并获取导致从column1检测到差异的唯一值。因此,在这种情况下,我应该得到的答案是“ Residence-Location”,“-12”,“ NAN”和“ NA”(空白)。它将第一列与第二列进行比较

还可以创建结果并将其存储在另一列中吗?

Result
index   column1         column2                     diff
1.      Admission Date  Residence - Location        Residence - Location
2.      Malnutrition    Malnutrition-12             -12
3.      TB              NAN                         NAN
4.      Anaemia         NA                          NA

代码可以使用R或Python。我不介意

def FindDifference(Row):
    x = Row['column1']
    y = Row['column2']

    Difference = ""
    if pd.isnull(y) or y=="nan" or y=="NA":
        return NaN
    if len(x) <= len(y):
        for i in y:
            if i not in x:
                Difference += str(i)
    else:
        for i in x:
            if i not in y:
                Difference += str(i)
    return Difference

ReadDataT = Final_df[['column1','column2']] 
ReadDataT['diff']= ReadDataT.apply(lambda x: FindDifference(x),axis=1)
ReadDataT

此代码的问题是比较两个字符之间的每个字符并给出不仅在两列中的字符结果...就像第一行将'RC-Lc'作为差异

3 个答案:

答案 0 :(得分:3)

library(dplyr); library(stringr)
df %>% mutate(diff = str_remove(column2, column1))

  index        column1              column2                 diff
1     1 Admission Date Residence - Location Residence - Location
2     2   Malnutrition      Malnutrition-12                  -12
3     3             TB                  NAN                  NAN
4     4        Anaemia                 <NA>                 <NA>

编辑:不使用dplyr

df$diff = stringr::str_remove(df$column2, df$column1)

答案 1 :(得分:1)

对于Python:

df = df.replace(np.nan, '', regex = True)
df['diff'] = df.apply(lambda x: x['column2'].replace(x['column1'], '').strip(), axis = 1)
df = df.replace('', np.nan, regex = True)

输出:

          column1               column2                  diff
0  Admission Date  Residence - Location  Residence - Location
1    Malnutrition       Malnutrition-12                   -12
2              TB                   NaN                   NaN
3         Anaemia                   NaN                   NaN

答案 2 :(得分:0)

在基数R中,我们可以将submapply一起使用

df$diff <- mapply(function(x, y) sub(x, "", y), df$column1, df$column2)

df
#  index        column1              column2                 diff
#1     1 Admission Date Residence - Location Residence - Location
#2     2   Malnutrition      Malnutrition-12                  -12
#3     3             TB                  NAN                  NAN
#4     4        Anaemia                 <NA>                 <NA>