去除成对的加成逆

时间:2019-02-13 08:16:37

标签: r

此问题基于我之前在SO上提出的两个问题,每个问题都比上一个复杂。在上一篇文章-How to identify mirrored duplicates of rows in R-

我想识别数据框中行的“部分”匹配。具体来说,如果数据框中的特定行根据列的子集之间的匹配,在数据框中的特定行在数据框中的其他地方有重复的行,则我想创建一个值为TRUE的新列。额外的复杂性是,数据框中的一列是数字,如果绝对值匹配,我想匹配。问题是,我需要确保在将某行标识为部分重复时,仅当匹配项中的一列是相对的(加反)镜像,而不仅仅是绝对值上的匹配时值。最终,我要寻找的是沿着2个类别变量成对重复的行,并且沿着第三个数值变量成加法逆。为了使事情更清楚,这里是示例数据:

name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill Clinton", "Richard Nixon", "Abe Lincoln","Richard Nixon", "Bill Clinton", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California", "Indiana", "California","Oregon","California", "Indiana", "California")
num<-c("-258", "123", "42", "258", "123", "-258", "87","258", "-123", "258")
date<-c("day 9", "day 2", "day 15", "day 3","day 45", "day 100", "day 99", "day 10", "day 11", "day 100")

(df <- data.frame(name, state, num, date, stringsAsFactors = FALSE))
            name      state  num    date
1  Richard Nixon California -258   day 9
2   Bill Clinton    Indiana  123   day 2
3    George Bush    Florida   42  day 15
4  Richard Nixon California  258   day 3
5   Bill Clinton    Indiana  123  day 45
6  Richard Nixon California -258 day 100
7    Abe Lincoln     Oregon   87  day 99
8  Richard Nixon California  258  day 10
9   Bill Clinton    Indiana -123  day 11
10 Richard Nixon California  258 day 100

如果我要运行我之前发布的SO问题的出色解决方案,则会导致以下问题

(df %>%
    mutate(num = as.numeric(num), num1 = abs(num)) %>% 
    group_by(name, state, num1) %>% 
    mutate(newcol = n() > 1 & n_distinct(sign(num)) > 1) %>%
    ungroup %>% 
    select(-num1)) %>%
    arrange(name)
# A tibble: 10 x 5
   name          state        num date    newcol
   <chr>         <chr>      <dbl> <chr>   <lgl> 
 1 Abe Lincoln   Oregon        87 day 99  FALSE 
 2 Bill Clinton  Indiana      123 day 2   TRUE  
 3 Bill Clinton  Indiana      123 day 45  TRUE  
 4 Bill Clinton  Indiana     -123 day 11  TRUE  
 5 George Bush   Florida       42 day 15  FALSE 
 6 Richard Nixon California  -258 day 9   TRUE  
 7 Richard Nixon California   258 day 3   TRUE  
 8 Richard Nixon California  -258 day 100 TRUE  
 9 Richard Nixon California   258 day 10  TRUE  
10 Richard Nixon California   258 day 100 TRUE

以上输出的问题在于,对于理查德·尼克松和比尔·克林顿而言,出现TRUE的行太多了。我想要的输出如下:

   name          state        num date    newcol
 1 Abe Lincoln   Oregon        87 day 99  FALSE 
 2 Bill Clinton  Indiana      123 day 2   TRUE
 3 Bill Clinton  Indiana      123 day 45  FALSE
 4 Bill Clinton  Indiana     -123 day 11  TRUE  
 5 George Bush   Florida       42 day 15  FALSE 
 6 Richard Nixon California  -258 day 9   TRUE  
 7 Richard Nixon California   258 day 3   TRUE  
 8 Richard Nixon California  -258 day 100 TRUE  
 9 Richard Nixon California   258 day 10  TRUE  
10 Richard Nixon California   258 day 100 FALSE

请注意,只有在镜像匹配-镜像匹配的行中,行才是重复的,除了它们是列num的加和逆之外,该如何匹配。因此,我基本上是尝试确定沿namestate变量匹配的所有行,以及沿num变量彼此相加的所有行,但前提是该附加逆是唯一的-从num仅应被视为不超过另一行的加法逆的意义上讲,它是唯一的。

如果需要澄清以上说明,请进一步了解:

因此,某些过程将遍历每一行,以标识满足部分匹配标准(在绝对值的意义上是部分匹配/加法逆)的第一行,然后为这两行分配TRUE,然后继续执行接下来的观察等等。例如,代码可以从Abe Lincoln开始并遍历每个后续行,直到找到部分匹配的行,如果找不到行,则在列newcol FALSE中应生成的值。然后继续前进到印第安纳州123号的比尔·克林顿,遍历各行以识别部分匹配。下一行不是部分匹配b / c 123和123不是部分匹配(它们是完全匹配),但是下一行是部分匹配(123和-123),结果为TRUEnewcol生成该观察结果以及部分匹配的行。然后继续到第三行(比尔·克林顿,印第安纳州123)。此步骤的重要部分是,如果某行已经具有newcol的值,则循环不需要遍历该行。因此,对于该行(第三行),循环将跳过第一行(使用Abe Lincoln)b / c,其值已经为FALSE,它将跳过第二行和第四行,因为这两行分别是已经匹配在一起,导致第三行b / c的FALSE没有剩余行被部分匹配,并且数据帧中唯一的部分匹配已经与另一个逆匹配。

2 个答案:

答案 0 :(得分:4)

我们可能需要对sign进行第二次分组,以创建另一组序列,这将有助于识别那些没有匹配对的行并将其更新为FALSE

library(dplyr)
df %>%
     mutate(num = as.numeric(num), num1 = abs(num)) %>% 
     group_by(name, state, num1) %>% 
     mutate(newcol = n() > 1 & n_distinct(sign(num)) > 1) %>% 
     group_by(grp = sign(num), add = TRUE) %>% 
     mutate(rn = row_number()) %>% 
     group_by(name, state, num1, rn) %>% 
     mutate(newcol = replace(newcol, n()==1, FALSE)) %>%
     ungroup %>%
     select(-grp, -num1, -rn) %>% 
     arrange(name)
#A tibble: 10 x 5
#   name          state        num date    newcol
#   <chr>         <chr>      <dbl> <chr>   <lgl> 
# 1 Abe Lincoln   Oregon        87 day 99  FALSE 
# 2 Bill Clinton  Indiana      123 day 2   TRUE  
# 3 Bill Clinton  Indiana      123 day 45  FALSE 
# 4 Bill Clinton  Indiana     -123 day 11  TRUE  
# 5 George Bush   Florida       42 day 15  FALSE 
# 6 Richard Nixon California  -258 day 9   TRUE  
# 7 Richard Nixon California   258 day 3   TRUE  
# 8 Richard Nixon California  -258 day 100 TRUE  
# 9 Richard Nixon California   258 day 10  TRUE  
#10 Richard Nixon California   258 day 100 FALSE 

答案 1 :(得分:2)

这是一个简单的工作解决方案,将根据您具有多个以上非镜像观察值的情况进行扩展。基本思想很简单:分组,找出正负数,对观察结果进行排序,使负数继续正数,确定负数或正数是否过量,然后生成TRUE / FALSE向量。由于观察是从否定到正序排列的,因此可以清楚地说明负值不匹配或正值不匹配的情况下结果向量的样子。

以下代码:

# Load data and libraries
library(dplyr)
name<-c("Richard Nixon", "Bill Clinton", "George Bush", "Richard Nixon", "Bill Clinton", "Richard Nixon", "Abe Lincoln","Richard Nixon", "Bill Clinton", "Richard Nixon")
state<-c("California", "Indiana", "Florida", "California", "Indiana", "California","Oregon","California", "Indiana", "California")
num<-c("-258", "123", "42", "258", "123", "-258", "87","258", "-123", "258")
date<-c("day 9", "day 2", "day 15", "day 3","day 45", "day 100", "day 99", "day 10", "day 11", "day 100")

# create dataframe
df <- data.frame(name, state, num, date, stringsAsFactors = FALSE)

df %>% 
  mutate(num = as.numeric(num), # to work with
              row = row_number() # for reordering
         ) %>%
  group_by(name, state) %>% 
  arrange(num) %>% # we order the observations so that all the negs
                   # proceed the pos. 
  mutate(negs = max(0, table(sign(num))["-1"], na.rm=T), # get the number of negatives
         pos = max(0, table(sign(num))["1"], na.rm=T), # get the number of positives
         newcol = ifelse(negs > pos, # See which is in excess
                         c(rep(FALSE, negs[1]-pos[1]), rep(TRUE, 2*pos[1])),
                         c(rep(TRUE, 2*negs[1]), rep(FALSE, pos[1]-negs[1])))
         ) %>%
  arrange(name, row) %>%
  dplyr::select(-negs, -pos, -row)
#> # A tibble: 10 x 5
#> # Groups:   name, state [4]
#>    name          state        num date    newcol
#>    <chr>         <chr>      <dbl> <chr>   <lgl> 
#>  1 Abe Lincoln   Oregon        87 day 99  FALSE 
#>  2 Bill Clinton  Indiana      123 day 2   TRUE  
#>  3 Bill Clinton  Indiana      123 day 45  FALSE 
#>  4 Bill Clinton  Indiana     -123 day 11  TRUE  
#>  5 George Bush   Florida       42 day 15  FALSE 
#>  6 Richard Nixon California  -258 day 9   TRUE  
#>  7 Richard Nixon California   258 day 3   TRUE  
#>  8 Richard Nixon California  -258 day 100 TRUE  
#>  9 Richard Nixon California   258 day 10  TRUE  
#> 10 Richard Nixon California   258 day 100 FALSE

reprex package(v0.2.1)于2019-02-13创建