R-基于保留行的多列删除重复值

时间:2019-03-21 17:54:03

标签: r dplyr duplicates

merge()之后,我的数据集如下所示:

id  ValueA  ValueB  ValueC  ValueD  ValueE  ValueF
1   page a  100     email   page a  300     Social
2   page b  130     social  page b  401     Email
3   page c  200     email   page c  234     Referral
4   page c  200     email   page c  345     Email
5   page c  200     email   page c  654     Social
6   page a  345     social  page d  237     Social
7   page e  200     social  page e  745     Email
8   page e  200     social  page e  675     Referral
9   page f  989     email   page f  123     social
10  page a  123     referralpage g  132     email

我想删除基于“ ValueA”,“ ValueB”和“ ValueC”列的重复值,但保留第4、5和8行,因为ValueD,VeluE和ValueF仍然有效。

预期输出为

 id  ValueA  ValueB  ValueC  ValueD  ValueE  ValueF
1   page a  100     email   page a  300     Social
2   page b  130     social  page b  401     Email
3   page c  200     email   page c  234     Referral
4                           page c  345     Email
5                           page c  654     Social
6   page a  345     social  page d  237     Social
7   page e  200     social  page e  745     Email
8                           page e  675     Referral
9   page f  989     email   page f  123     social
10  page a  123     referralpage g  132     email

我尝试使用distinc()

df <- df %>% distinct(ValueA, ValueB, ValueC, .keep_all = T) 

但是它会删除整行

4 个答案:

答案 0 :(得分:1)

library(tidyverse)

# example data
dt = read.table(text = "
id  ValueA  ValueB  ValueC  ValueD  ValueE  ValueF
1   pagea  100     email   pagea  300     Social
2   pageb  130     social  pageb  401     Email
3   pagec  200     email   pagec  234     Referral
4   pagec  200     email   pagec  345     Email
5   pagec  200     email   pagec  654     Social
6   pagea  345     social  paged  237     Social
7   pagee  200     social  pagee  745     Email
8   pagee  200     social  pagee  675     Referral
9   pagef  989     email   pagef  123     social
10  pagea  123     referral pageg  132     email
", header=T, stringsAsFactors = F)

dt %>%
  group_by(ValueA, ValueB, ValueC) %>%    # for each combination of those variables
  mutate(flag = row_number()) %>%         # add the number of appearance (i.e. row number)
  ungroup() %>%                           # forget the grouping
  mutate_at(vars(ValueA, ValueB, ValueC), ~ifelse(flag > 1, "", .)) %>%  # update to empty cell if this is a duplicate row
  select(-flag) %>%                       # remove that column
  data.frame()                            # only for visualisation purpose

#    id ValueA ValueB   ValueC ValueD ValueE   ValueF
# 1   1  pagea    100    email  pagea    300   Social
# 2   2  pageb    130   social  pageb    401    Email
# 3   3  pagec    200    email  pagec    234 Referral
# 4   4                         pagec    345    Email
# 5   5                         pagec    654   Social
# 6   6  pagea    345   social  paged    237   Social
# 7   7  pagee    200   social  pagee    745    Email
# 8   8                         pagee    675 Referral
# 9   9  pagef    989    email  pagef    123   social
# 10 10  pagea    123 referral  pageg    132    email

答案 1 :(得分:1)

基于tidyverse的非R问题的答案是

df[duplicated(df[, c('ValueA', 'ValueB', 'ValueC')]), 
   c('ValueA', 'ValueB', 'ValueC')] <- ""

答案 2 :(得分:0)

此处的某些操作可能会有所帮助(在“有条件地更改列值”部分中)。 YMMV。

https://rstudio-pubs-static.s3.amazonaws.com/314427_a1a32bf219ea405c8728e35c72060f1a.html#change-column-value-conditionally

答案 3 :(得分:0)

您可以使用dplyr对要删除的重复值的列进行分组。由于无法按分组将其更改,因此可以创建没有重复项的新列。

test1<-test %>%
  group_by(ValueA, ValueB, ValueC) %>%
  mutate(ValueAA = ifelse(duplicated(ValueA), NA, ValueA),
         ValueBB = ifelse(duplicated(ValueB), NA, ValueB),
         ValueCC = ifelse(duplicated(ValueC), NA, ValueC)) %>%
  ungroup() %>%
  mutate(ValueA = ValueAA,
         ValueB = ValueBB,
         ValueC = ValueCC) %>%
  select(1:7)

现在,重复的值已替换为NA,但是您可以进一步将NA替换为空白。