在随后的How to identify partial duplicates of rows in R SO帖子中,我问如何摆脱部分重复的行。这是我问的:
我想识别数据框中行的“部分”匹配。具体来说,如果数据框中的特定行基于列的子集之间的匹配,则在数据框中的特定行在数据框中的其他地方有重复的行时,我想创建一个值为1的新列。更加复杂的是,数据框中的一列是数字,如果绝对值匹配,我想匹配。
问题是,我需要确保将某行标识为部分重复时,只有当匹配项中的一列是镜像相反的值,而不仅仅是绝对值的匹配时,才这样。为了使事情更清楚,这是上一篇文章的示例数据:
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California")
num<-c("-258", "123", "42", "258")
date<-c("day 2", "day 15", "day 3","day 45")
(df<-as.data.frame(cbind(name,state,num, date)))
name state num date
1 Richard Nixon California -258 day 2
2 Bill Clinton Indiana 123 day 15
3 George Bush Florida 42 day 3
4 Richard Nixon California 258 day 45
这是我上一篇文章的解决方案:
df$absnum = abs(as.numeric(as.character(df$num)))
df$newcol = duplicated(df[,c('name','state', 'absnum')]) |
duplicated(df[,c('name','state', 'absnum')], fromLast = T)
# name state num date absnum newcol
# 1 Richard Nixon California -258 day 2 258 TRUE
# 2 Bill Clinton Indiana 123 day 15 123 FALSE
# 3 George Bush Florida 42 day 3 42 FALSE
# 4 Richard Nixon California 258 day 45 258 TRUE
请注意,第1行和第4行在TRUE
下标记为newcol
,这很好。这是新的示例数据,但增加了复杂性问题:
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill
Clinton")
state<-c("California", "Indiana", "Florida", "California", "Indiana")
num<-c("-258", "123", "42", "258", "123")
date<-c("day 2", "day 15", "day 3","day 45", "day 100")
(df<-as.data.frame(cbind(name,state,num, date)))
name state num date
1 Richard Nixon California -258 day 2
2 Bill Clinton Indiana 123 day 15
3 George Bush Florida 42 day 3
4 Richard Nixon California 258 day 45
5 Bill Clinton Indiana 123 day 100
请注意,观察值2和5是部分重复的,但与1和4的方式不同。我只需要将TRUE
应用于其绝对值与原始值不匹配的观察值。所以我希望结果返回以下内容:
name state num date newcol
1 Richard Nixon California -258 day 2 TRUE
2 Bill Clinton Indiana 123 day 15 FALSE
3 George Bush Florida 42 day 3 FALSE
4 Richard Nixon California 258 day 45 TRUE
5 Bill Clinton Indiana 123 day 100 FALSE
当我只希望将其应用于行1和4时,上一则SO帖子提供的解决方案将TRUE
应用于行2和5。
答案 0 :(得分:2)
在基数R中,您可以对“部分”重复项使用与链接的问题相同的duplicated
测试,但是排除相同的值
df$numnum = as.numeric(as.character(df$num))
df$absnum = abs(df$numnum)
df$newcol = (duplicated(df[,c('name','state', 'absnum')]) |
duplicated(df[,c('name','state', 'absnum')], fromLast = T)) &
!(duplicated(df$numnum) | duplicated(df$numnum, fromLast = T))
# name state num date numnum absnum newcol
# 1 Richard Nixon California -258 day 2 -258 258 TRUE
# 2 Bill Clinton Indiana 123 day 15 123 123 FALSE
# 3 George Bush Florida 42 day 3 42 42 FALSE
# 4 Richard Nixon California 258 day 45 258 258 TRUE
# 5 Bill Clinton Indiana 123 day 100 123 123 FALSE
答案 1 :(得分:1)
一个选择是先将'num'转换为numeric
类型,再用abs
olute值('num1')创建另一列,并按'name','state',' num1',mutate
,通过检查等于2的行数(n() == 2
)和'num'的不同sign
数大于1来创建bool列
library(tidyverse)
df %>%
mutate(num = as.numeric(num), num1 = abs(num)) %>%
group_by(name, state, num1) %>%
mutate(newcol = n() == 2 & n_distinct(sign(num)) > 1) %>%
ungroup %>%
select(-num1)
# A tibble: 5 x 5
# name state num date newcol
# <chr> <chr> <dbl> <chr> <lgl>
#1 Richard Nixon California -258 day 2 TRUE
#2 Bill Clinton Indiana 123 day 15 FALSE
#3 George Bush Florida 42 day 3 FALSE
#4 Richard Nixon California 258 day 45 TRUE
#5 Bill Clinton Indiana 123 day 100 FALSE
注意:cbind
创建一个matrix
,并且matrix
只能具有单个类型。因此,如果有任何字符列或元素,则整个矩阵将成为character
类。用data.frame
包装它,传播它并可以转换为factor
(默认为stringsAsFactors = TRUE
或character
(如果我们将其更改为FALSE
)>
df <- data.frame(name, state, num, date, stringsAsFactors = FALSE)