merge()之后,我的数据集如下所示:
id ValueA ValueB ValueC ValueD ValueE ValueF
1 page a 100 email page a 300 Social
2 page b 130 social page b 401 Email
3 page c 200 email page c 234 Referral
4 page c 200 email page c 345 Email
5 page c 200 email page c 654 Social
6 page a 345 social page d 237 Social
7 page e 200 social page e 745 Email
8 page e 200 social page e 675 Referral
9 page f 989 email page f 123 social
10 page a 123 referralpage g 132 email
我想删除基于“ ValueA”,“ ValueB”和“ ValueC”列的重复值,但保留第4、5和8行,因为ValueD,VeluE和ValueF仍然有效。
预期输出为
id ValueA ValueB ValueC ValueD ValueE ValueF
1 page a 100 email page a 300 Social
2 page b 130 social page b 401 Email
3 page c 200 email page c 234 Referral
4 page c 345 Email
5 page c 654 Social
6 page a 345 social page d 237 Social
7 page e 200 social page e 745 Email
8 page e 675 Referral
9 page f 989 email page f 123 social
10 page a 123 referralpage g 132 email
我尝试使用distinc()
df <- df %>% distinct(ValueA, ValueB, ValueC, .keep_all = T)
但是它会删除整行
答案 0 :(得分:1)
library(tidyverse)
# example data
dt = read.table(text = "
id ValueA ValueB ValueC ValueD ValueE ValueF
1 pagea 100 email pagea 300 Social
2 pageb 130 social pageb 401 Email
3 pagec 200 email pagec 234 Referral
4 pagec 200 email pagec 345 Email
5 pagec 200 email pagec 654 Social
6 pagea 345 social paged 237 Social
7 pagee 200 social pagee 745 Email
8 pagee 200 social pagee 675 Referral
9 pagef 989 email pagef 123 social
10 pagea 123 referral pageg 132 email
", header=T, stringsAsFactors = F)
dt %>%
group_by(ValueA, ValueB, ValueC) %>% # for each combination of those variables
mutate(flag = row_number()) %>% # add the number of appearance (i.e. row number)
ungroup() %>% # forget the grouping
mutate_at(vars(ValueA, ValueB, ValueC), ~ifelse(flag > 1, "", .)) %>% # update to empty cell if this is a duplicate row
select(-flag) %>% # remove that column
data.frame() # only for visualisation purpose
# id ValueA ValueB ValueC ValueD ValueE ValueF
# 1 1 pagea 100 email pagea 300 Social
# 2 2 pageb 130 social pageb 401 Email
# 3 3 pagec 200 email pagec 234 Referral
# 4 4 pagec 345 Email
# 5 5 pagec 654 Social
# 6 6 pagea 345 social paged 237 Social
# 7 7 pagee 200 social pagee 745 Email
# 8 8 pagee 675 Referral
# 9 9 pagef 989 email pagef 123 social
# 10 10 pagea 123 referral pageg 132 email
答案 1 :(得分:1)
基于tidyverse
的非R
问题的答案是
df[duplicated(df[, c('ValueA', 'ValueB', 'ValueC')]),
c('ValueA', 'ValueB', 'ValueC')] <- ""
答案 2 :(得分:0)
此处的某些操作可能会有所帮助(在“有条件地更改列值”部分中)。 YMMV。
答案 3 :(得分:0)
您可以使用dplyr对要删除的重复值的列进行分组。由于无法按分组将其更改,因此可以创建没有重复项的新列。
test1<-test %>%
group_by(ValueA, ValueB, ValueC) %>%
mutate(ValueAA = ifelse(duplicated(ValueA), NA, ValueA),
ValueBB = ifelse(duplicated(ValueB), NA, ValueB),
ValueCC = ifelse(duplicated(ValueC), NA, ValueC)) %>%
ungroup() %>%
mutate(ValueA = ValueAA,
ValueB = ValueBB,
ValueC = ValueCC) %>%
select(1:7)
现在,重复的值已替换为NA,但是您可以进一步将NA替换为空白。