此问题基于我之前在SO上提出的两个问题,每个问题都比上一个复杂。在上一篇文章-How to identify mirrored duplicates of rows in R-
我想识别数据框中行的“部分”匹配。具体来说,如果数据框中的特定行根据列的子集之间的匹配,在数据框中的特定行在数据框中的其他地方有重复的行,则我想创建一个值为TRUE的新列。额外的复杂性是,数据框中的一列是数字,如果绝对值匹配,我想匹配。问题是,我需要确保在将某行标识为部分重复时,仅当匹配项中的一列是相对的(加反)镜像,而不仅仅是绝对值上的匹配时值。最终,我要寻找的是沿着2个类别变量成对重复的行,并且沿着第三个数值变量成加法逆。为了使事情更清楚,这里是示例数据:
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill Clinton", "Richard Nixon", "Abe Lincoln","Richard Nixon", "Bill Clinton", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California", "Indiana", "California","Oregon","California", "Indiana", "California")
num<-c("-258", "123", "42", "258", "123", "-258", "87","258", "-123", "258")
date<-c("day 9", "day 2", "day 15", "day 3","day 45", "day 100", "day 99", "day 10", "day 11", "day 100")
(df <- data.frame(name, state, num, date, stringsAsFactors = FALSE))
name state num date
1 Richard Nixon California -258 day 9
2 Bill Clinton Indiana 123 day 2
3 George Bush Florida 42 day 15
4 Richard Nixon California 258 day 3
5 Bill Clinton Indiana 123 day 45
6 Richard Nixon California -258 day 100
7 Abe Lincoln Oregon 87 day 99
8 Richard Nixon California 258 day 10
9 Bill Clinton Indiana -123 day 11
10 Richard Nixon California 258 day 100
如果我要运行我之前发布的SO问题的出色解决方案,则会导致以下问题
(df %>%
mutate(num = as.numeric(num), num1 = abs(num)) %>%
group_by(name, state, num1) %>%
mutate(newcol = n() > 1 & n_distinct(sign(num)) > 1) %>%
ungroup %>%
select(-num1)) %>%
arrange(name)
# A tibble: 10 x 5
name state num date newcol
<chr> <chr> <dbl> <chr> <lgl>
1 Abe Lincoln Oregon 87 day 99 FALSE
2 Bill Clinton Indiana 123 day 2 TRUE
3 Bill Clinton Indiana 123 day 45 TRUE
4 Bill Clinton Indiana -123 day 11 TRUE
5 George Bush Florida 42 day 15 FALSE
6 Richard Nixon California -258 day 9 TRUE
7 Richard Nixon California 258 day 3 TRUE
8 Richard Nixon California -258 day 100 TRUE
9 Richard Nixon California 258 day 10 TRUE
10 Richard Nixon California 258 day 100 TRUE
以上输出的问题在于,对于理查德·尼克松和比尔·克林顿而言,出现TRUE的行太多了。我想要的输出如下:
name state num date newcol
1 Abe Lincoln Oregon 87 day 99 FALSE
2 Bill Clinton Indiana 123 day 2 TRUE
3 Bill Clinton Indiana 123 day 45 FALSE
4 Bill Clinton Indiana -123 day 11 TRUE
5 George Bush Florida 42 day 15 FALSE
6 Richard Nixon California -258 day 9 TRUE
7 Richard Nixon California 258 day 3 TRUE
8 Richard Nixon California -258 day 100 TRUE
9 Richard Nixon California 258 day 10 TRUE
10 Richard Nixon California 258 day 100 FALSE
请注意,只有在镜像匹配-镜像匹配的行中,行才是重复的,除了它们是列num
的加和逆之外,该如何匹配。因此,我基本上是尝试确定沿name
和state
变量匹配的所有行,以及沿num
变量彼此相加的所有行,但前提是该附加逆是唯一的-从num
仅应被视为不超过另一行的加法逆的意义上讲,它是唯一的。
如果需要澄清以上说明,请进一步了解:
因此,某些过程将遍历每一行,以标识满足部分匹配标准(在绝对值的意义上是部分匹配/加法逆)的第一行,然后为这两行分配TRUE,然后继续执行接下来的观察等等。例如,代码可以从Abe Lincoln开始并遍历每个后续行,直到找到部分匹配的行,如果找不到行,则在列newcol
FALSE
中应生成的值。然后继续前进到印第安纳州123号的比尔·克林顿,遍历各行以识别部分匹配。下一行不是部分匹配b / c 123和123不是部分匹配(它们是完全匹配),但是下一行是部分匹配(123和-123),结果为TRUE
为newcol
生成该观察结果以及部分匹配的行。然后继续到第三行(比尔·克林顿,印第安纳州123)。此步骤的重要部分是,如果某行已经具有newcol
的值,则循环不需要遍历该行。因此,对于该行(第三行),循环将跳过第一行(使用Abe Lincoln)b / c,其值已经为FALSE
,它将跳过第二行和第四行,因为这两行分别是已经匹配在一起,导致第三行b / c的FALSE
没有剩余行被部分匹配,并且数据帧中唯一的部分匹配已经与另一个逆匹配。
答案 0 :(得分:4)
我们可能需要对sign
进行第二次分组,以创建另一组序列,这将有助于识别那些没有匹配对的行并将其更新为FALSE
library(dplyr)
df %>%
mutate(num = as.numeric(num), num1 = abs(num)) %>%
group_by(name, state, num1) %>%
mutate(newcol = n() > 1 & n_distinct(sign(num)) > 1) %>%
group_by(grp = sign(num), add = TRUE) %>%
mutate(rn = row_number()) %>%
group_by(name, state, num1, rn) %>%
mutate(newcol = replace(newcol, n()==1, FALSE)) %>%
ungroup %>%
select(-grp, -num1, -rn) %>%
arrange(name)
#A tibble: 10 x 5
# name state num date newcol
# <chr> <chr> <dbl> <chr> <lgl>
# 1 Abe Lincoln Oregon 87 day 99 FALSE
# 2 Bill Clinton Indiana 123 day 2 TRUE
# 3 Bill Clinton Indiana 123 day 45 FALSE
# 4 Bill Clinton Indiana -123 day 11 TRUE
# 5 George Bush Florida 42 day 15 FALSE
# 6 Richard Nixon California -258 day 9 TRUE
# 7 Richard Nixon California 258 day 3 TRUE
# 8 Richard Nixon California -258 day 100 TRUE
# 9 Richard Nixon California 258 day 10 TRUE
#10 Richard Nixon California 258 day 100 FALSE
答案 1 :(得分:2)
这是一个简单的工作解决方案,将根据您具有多个以上非镜像观察值的情况进行扩展。基本思想很简单:分组,找出正负数,对观察结果进行排序,使负数继续正数,确定负数或正数是否过量,然后生成TRUE / FALSE向量。由于观察是从否定到正序排列的,因此可以清楚地说明负值不匹配或正值不匹配的情况下结果向量的样子。
以下代码:
# Load data and libraries
library(dplyr)
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill Clinton", "Richard Nixon", "Abe Lincoln","Richard Nixon", "Bill Clinton", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California", "Indiana", "California","Oregon","California", "Indiana", "California")
num<-c("-258", "123", "42", "258", "123", "-258", "87","258", "-123", "258")
date<-c("day 9", "day 2", "day 15", "day 3","day 45", "day 100", "day 99", "day 10", "day 11", "day 100")
# create dataframe
df <- data.frame(name, state, num, date, stringsAsFactors = FALSE)
df %>%
mutate(num = as.numeric(num), # to work with
row = row_number() # for reordering
) %>%
group_by(name, state) %>%
arrange(num) %>% # we order the observations so that all the negs
# proceed the pos.
mutate(negs = max(0, table(sign(num))["-1"], na.rm=T), # get the number of negatives
pos = max(0, table(sign(num))["1"], na.rm=T), # get the number of positives
newcol = ifelse(negs > pos, # See which is in excess
c(rep(FALSE, negs[1]-pos[1]), rep(TRUE, 2*pos[1])),
c(rep(TRUE, 2*negs[1]), rep(FALSE, pos[1]-negs[1])))
) %>%
arrange(name, row) %>%
dplyr::select(-negs, -pos, -row)
#> # A tibble: 10 x 5
#> # Groups: name, state [4]
#> name state num date newcol
#> <chr> <chr> <dbl> <chr> <lgl>
#> 1 Abe Lincoln Oregon 87 day 99 FALSE
#> 2 Bill Clinton Indiana 123 day 2 TRUE
#> 3 Bill Clinton Indiana 123 day 45 FALSE
#> 4 Bill Clinton Indiana -123 day 11 TRUE
#> 5 George Bush Florida 42 day 15 FALSE
#> 6 Richard Nixon California -258 day 9 TRUE
#> 7 Richard Nixon California 258 day 3 TRUE
#> 8 Richard Nixon California -258 day 100 TRUE
#> 9 Richard Nixon California 258 day 10 TRUE
#> 10 Richard Nixon California 258 day 100 FALSE
由reprex package(v0.2.1)于2019-02-13创建