如何在R中一次删除多个列的单个列中的重复值

时间:2017-10-04 15:09:37

标签: r machine-learning data-science

  

示例数据

           sessionid             qf      Office
                12                3       LON1,LON2,LON1,SEA2,SEA3,SEA3,SEA3
                12                4       DEL2,DEL1,LON1,DEL1
                13                5       MAn1,LON1,DEL1,LON1

这里我想删除列#34; OFFICE"中的重复值。每一行。

  

预期产出

            sessionid             qf      Office
                12                3       LON1,LON2,SEA2,SEA3
                12                4       DEL2,DEL1,LON1
                13                5       MAN1,LON1,DEL1

2 个答案:

答案 0 :(得分:2)

我们可以使用tidyverse。通过分隔符拆分“Office”并展开为“long”格式,然后获取distinct行,按'sessionid'分组,'qf',paste'Office'的内容

library(tidyverse)
separate_rows(df1, Office) %>%
      distinct() %>%
     group_by(sessionid, qf) %>% 
     summarise(Office = toString(Office))
# A tibble: 3 x 3
# Groups:   sessionid [?]
#  sessionid    qf                 Office
#      <int> <int>                  <chr>
#1        12     3 LON1, LON2, SEA2, SEA3
#2        12     4       DEL2, DEL1, LON1
#3        13     5       MAn1, LON1, DEL1

答案 1 :(得分:2)

这是一种基本的R方式,它按照您的预期工作,首先用逗号分割Office,删除重复项,然后再粘贴在一起

df$Office <- sapply(lapply(strsplit(df$Office, ","),
                           function(x) {
                             unique(x)
                           }),
                    function(x) {
                      paste(x, collapse = ",")
                    },
                    simplify = T)

%>%

df$Office <-  df$Office %>%
  strsplit(",") %>%
  lapply(function(x){unique(x)}) %>%
  sapply(function(x){paste(x,collapse = ",")},simplify = T)