如果值显示大于3次,如何将下一个重复值作为空白

时间:2017-05-15 05:03:52

标签: r

我有一个数据框如下。在数据框中,对于“A”重复/出现的值“45”大于3次,对于“B”对于“67”也是相同的,现在需要将它们作为“空白/ NA”用于重复/冻结的那些大于3次(“New_value”)

Name    Value   New_Value
 A       24      24
 A       45      45
 A       45      
 A       45      
 A       45      
 A       45      
 A       93      93 
 A       19      19
 A       10      10
 B       29      29
 B       67      67
 B       67         
 B       67      
 B       67      
 C      201     201
 C      993     993
 C      396     396

3 个答案:

答案 0 :(得分:2)

FWIW,这是使用data.table而非rleid()的另一个duplicated()解决方案。

请注意,如果值显示大于3次,则OP请求将下一个重复值设为空白。这意味着对于Value重复两次没有空白应该出现在结果中。我修改了我的样本数据集,以包含两次重复相同值的情况。

编辑:OP没有明确表示他是否正在计算给定序列中相同Value的重复次数,而不管Name是否为Name计算每个Value组序列中的重复次数。另见this comment

此外,OP没有指定他预期的结果,如果有一系列重复NameDT # Name Value # 1: A 24 # 2: A 24 # 3: A 45 # 4: A 45 # 5: A 45 # 6: A 45 # 7: A 45 # 8: A 93 # 9: A 19 #10: A 19 #11: A 10 #12: B 29 #13: B 67 #14: B 67 #15: B 67 #16: B 67 #17: C 201 #18: C 993 #19: C 396 #20: A 19 #21: A 19 #22: C 19 #23: B 29 #24: B 67 #25: B 67 #26: B 67 #27: B 67 #28: C 67 #29: C 67 #30: C 67 #31: C 67 # Name Value 发生了变化。

因此,我修改了我的示例数据集以包含其他用例:

NA

与其他答案一样,library(data.table) setDT(DT)[, New := Value[.N < 3], by=rleid(Value)][rowid(rleid(Value)) == 1L, New := Value] DT # Name Value New # 1: A 24 24 # 2: A 24 24 # 3: A 45 45 # 4: A 45 NA # 5: A 45 NA # 6: A 45 NA # 7: A 45 NA # 8: A 93 93 # 9: A 19 19 #10: A 19 19 #11: A 10 10 #12: B 29 29 #13: B 67 67 #14: B 67 NA #15: B 67 NA #16: B 67 NA #17: C 201 201 #18: C 993 993 #19: C 396 396 #20: A 19 19 #21: A 19 NA #22: C 19 NA #23: B 29 29 #24: B 67 67 #25: B 67 NA #26: B 67 NA #27: B 67 NA #28: C 67 NA #29: C 67 NA #30: C 67 NA #31: C 67 NA # Name Value New 为空白。

Value

第一个表达式为所有RLE组复制NA,重复一次或两次。重复次数更多的所有RLE组获得Value。第二个表达式仅为每个RLE组中的第一行复制Name

请注意,重复值的每个序列都是单独处理的,无论A是什么,但第22行CBCsetDT(DT)[, New := Value[.N < 3], by=rleid(Value) ][is.na(New) & rowid(rleid(Value)) == 1L, New := Value] 的变化第27行被忽略了。

这可以进一步改进,只有在尚未复制时才能复制:

Name

如果Value的变化预期为&#34;重新启动&#34; setDT(DT)[, New := Value[.N < 3], by = rleid(Name, Value) ][is.na(New) & rowid(rleid(Name, Value)) == 1L, New := Value][] # Name Value New # 1: A 24 24 # 2: A 24 24 # 3: A 45 45 # 4: A 45 NA # 5: A 45 NA # ... #18: C 993 993 #19: C 396 396 #20: A 19 19 #21: A 19 19 #22: C 19 19 #23: B 29 29 #24: B 67 67 #25: B 67 NA #26: B 67 NA #27: B 67 NA #28: C 67 67 #29: C 67 NA #30: C 67 NA #31: C 67 NA # Name Value New 也可以使用这个变体(Jaap的信用):

DT <- structure(list(Name = c("A", "A", "A", "A", "A", "A", "A", "A", 
"A", "A", "A", "B", "B", "B", "B", "B", "C", "C", "C", "A", "A", 
"C", "B", "B", "B", "B", "B", "C", "C", "C", "C"), Value = c(24L, 
24L, 45L, 45L, 45L, 45L, 45L, 93L, 19L, 19L, 10L, 29L, 67L, 67L, 
67L, 67L, 201L, 993L, 396L, 19L, 19L, 19L, 29L, 67L, 67L, 67L, 
67L, 67L, 67L, 67L, 67L)), .Names = c("Name", "Value"), row.names = c(NA, 
-31L), class = "data.frame")

注意第21,22和27行的差异。

数据

;WITH tbl_to_synch as (
    -- Prepare table to update,
    Select *,chk = CHECKSUM(*) from [dbo].[tableA]
)
MERGE tbl_to_synch as [Target]
USING (Select *,chk = CHECKSUM(*) from [dbo].[tableB]) as [source]
ON [Target].key = [source].key
WHEN MATCHED AND [Target].chk <> [source].chk THEN 
-- UPDATE ONLY row that is changed
UPDATE
    SET 
        column01 = [source].[column01]
        ,column02 = [source].[column01]
        -- .... 
        ,column59 = [source].[column59]
        ,column60 = [source].[column59]

WHEN NOT MATCHED BY TARGET THEN
    insert (column01, column02, ...,column59,column60)
    values (column01, column02, ...,column59,column60)
WHEN NOT MATCHED BY SOURCE THEN DELETE
-- Show what is changed
OUTPUT $action, ISNULL(INSERTED.key,DELETED.key);

请注意,第1行和第8行已经重复了w.r.t. OP的数据集覆盖了两次重复的情况,最后添加了几行。

答案 1 :(得分:0)

这是data.table方法。我为

展示了两种解决方案
  1. 您在Names
  2. 中查找重复项
  3. 您在所有数据中查找重复项
  4. 以下是代码:

    library(data.table)
    dt <- data.table(Names = LETTERS[1:5] %>% sample(100, replace = TRUE),
                     Value = sample(1:10, 100, replace = TRUE))
    dt <- dt[order(Names, Value)]
    
    # if you look for in-group duplicates
    dt[, count := .N, by = .(Names, Value)][, New_Value := Value]
    dt[ , dup_ingroup := duplicated(Value), by = Names]
    dt[dup_ingroup & count > 3, New_Value := NA]
    
    # if you look for all duplicates
    dt[, count := .N, by = Value][, New_Value := Value]
    dt[duplicated(Value) & count > 3, New_Value := NA]
    

    注意

    以下评论。

    library(data.table)
    library(dplyr)
    set.seed(20170515)
    dt <- data.table(Names = LETTERS[1:5] %>% sample(100, replace = TRUE),
                     Value = sample(1:10, 100, replace = TRUE))
    dt <- dt[order(Names, Value)]
    dt_1 <- copy(dt)
    dt_2 <- copy(dt) 
    dt_Jaap <- copy(dt)
    # Method 1
    dt_1[, count := .N, by = .(Names, Value)][, New_Value := Value]
    dt_1[ , dup_ingroup := duplicated(Value), by = Names]
    dt_1[dup_ingroup & count > 3, New_Value := NA]
    dt_1[, .N, by = is.na(New_Value)] 
    ## is.na  N
    ## 1: FALSE 73
    ## 2:  TRUE 27
    
    # Method 2
    dt_2[, count := .N, by = Value][, New_Value := Value]
    dt_2[duplicated(Value) & count > 3, New_Value := NA]
    dt_2[, .N, by = is.na(New_Value)] 
    ## is.na  N
    ## 1: FALSE 12
    ## 2:  TRUE 88
    
    # Method suggested by @Jaap
    dt_Jaap[, New_Value := Value][duplicated(Value) & .N > 3, New_Value := NA_integer_, by = .(Names, Value)]
    dt_Jaap[, .N, by = is.na(New_Value)]  
    ## is.na  N
    ## 1: FALSE 10
    ## 2:  TRUE 90
    

    dt_Jaap只保留每个值Value的第一个元素的值。

答案 2 :(得分:-1)

dplyr / tidyverse方式,假设数据框的顺序无关紧要......

df <- data.frame(Name = c("A","A","A","A","A","A","A","A","A","B","B","B","B","B","C","C","C","C","C"),
                 Value = c(24,45,45,45,45,45,93,19,10,29,67,67,67,67,201,993,396,396,396),
                 stringsAsFactors = F)

library(dplyr)

df %>% 
  group_by(Name, Value) %>% 
  mutate(New_Value = ifelse(n() > 3 & row_number() > 1, NA, Value))

<强>更新

一种更强大的方法,可以处理多组相同的值...

df <- read.table(header = T, stringsAsFactors = F, text = "
Name    Value
A       45
A       45
A       45
A       82
A       45
A       45
A       45
A       45
A       12
A       45
A       45
A       45
A       45
A       45
B       29
B       67
B       67
B       67
B       67
")

library(dplyr)

df %>%
  group_by(Name) %>%
  mutate(run_length = with(rle(Value), rep(lengths, lengths))) %>%
  mutate(run_start = seq_along(Value) %in% cumsum(c(1, rle(Value)$lengths))) %>%
  mutate(New_Value = ifelse(run_length < 4 | run_start, Value, NA)) %>%
  ungroup() %>% select(-run_length, -run_start)