仅保留第一个重复的行

时间:2017-11-20 14:34:12

标签: r

在我的数据框中,如果字符串Position在第一行下面的行中多次出现,我想只保留第一个行。请参阅我的输出示例。 我正在尝试duplicated函数,但我不确定如何保留第一行。

Time    Pos
2006-01-12  Position
2006-01-16  Position
2006-01-17  Position
2006-02-01  
2006-02-01  Position
2006-02-02  
2006-02-02  Position
2006-02-02  Position
2006-02-02  Position
2006-04-04  Position
2006-04-06  Position
2006-04-06  Position
2006-10-11  
2006-10-17  Position
2006-10-18  
2006-10-18  Position
2006-10-18  
2006-10-18  Position
2006-10-18  
2006-10-18  Position
2006-10-18  Position
2006-10-18  Position
2006-10-18  Position
2006-10-19  Position

出:

Time    Pos
2006-01-12  Position
2006-02-01  
2006-02-01  Position
2006-02-02  
2006-02-02  Position
2006-10-11  
2006-10-17  Position
2006-10-18  
2006-10-18  Position
2006-10-18  
2006-10-18  Position
2006-10-18  
2006-10-18  Position

3 个答案:

答案 0 :(得分:3)

以下是dplyr + data.table::rleid的解决方案:

library(dplyr)

df %>%
  mutate(ID = data.table::rleid(df$Pos)) %>%
  group_by(ID) %>%
  slice(1) %>%
  ungroup() %>%
  select(-ID)

<强>结果:

# A tibble: 13 x 2
         Time      Pos
        <chr>    <chr>
 1 2006-01-12 Position
 2 2006-02-01         
 3 2006-02-01 Position
 4 2006-02-02         
 5 2006-02-02 Position
 6 2006-10-11         
 7 2006-10-17 Position
 8 2006-10-18         
 9 2006-10-18 Position
10 2006-10-18         
11 2006-10-18 Position
12 2006-10-18         
13 2006-10-18 Position

data.table等效:

setDT(df)[, .SD[1], by = rleid(Pos), .SDcol = c("Time", "Pos")]

<强>结果:

    rleid       Time      Pos
 1:     1 2006-01-12 Position
 2:     2 2006-02-01         
 3:     3 2006-02-01 Position
 4:     4 2006-02-02         
 5:     5 2006-02-02 Position
 6:     6 2006-10-11         
 7:     7 2006-10-17 Position
 8:     8 2006-10-18         
 9:     9 2006-10-18 Position
10:    10 2006-10-18         
11:    11 2006-10-18 Position
12:    12 2006-10-18         
13:    13 2006-10-18 Position

数据:

df = structure(list(Time = c("2006-01-12", "2006-01-16", "2006-01-17", 
"2006-02-01", "2006-02-01", "2006-02-02", "2006-02-02", "2006-02-02", 
"2006-02-02", "2006-04-04", "2006-04-06", "2006-04-06", "2006-10-11", 
"2006-10-17", "2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18", 
"2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18", 
"2006-10-19"), Pos = c("Position", "Position", "Position", "", 
"Position", "", "Position", "Position", "Position", "Position", 
"Position", "Position", "", "Position", "", "Position", "", "Position", 
"", "Position", "Position", "Position", "Position", "Position"
)), .Names = c("Time", "Pos"), class = "data.frame", row.names = c(NA, 
-24L))

答案 1 :(得分:2)

df[head(cumsum(c(1, (rle(df$Pos)$lengths))), -1),]
#         Time      Pos
#1  2006-01-12 Position
#4  2006-02-01         
#5  2006-02-01 Position
#6  2006-02-02         
#7  2006-02-02 Position
#13 2006-10-11         
#14 2006-10-17 Position
#15 2006-10-18         
#16 2006-10-18 Position
#17 2006-10-18         
#18 2006-10-18 Position
#19 2006-10-18         
#20 2006-10-18 Position

答案 2 :(得分:1)

您可以尝试使用滞后:

library(dplyr)

df2 <- df %>%
     mutate(pos = ifelse(Pos == "Position", 1, 0),
            lag = lag(pos, k=1)) %>%
     filter(is.na(lag) | lag == 0)