在我的数据框中,如果字符串Position
在第一行下面的行中多次出现,我想只保留第一个行。请参阅我的输出示例。
我正在尝试duplicated
函数,但我不确定如何保留第一行。
Time Pos
2006-01-12 Position
2006-01-16 Position
2006-01-17 Position
2006-02-01
2006-02-01 Position
2006-02-02
2006-02-02 Position
2006-02-02 Position
2006-02-02 Position
2006-04-04 Position
2006-04-06 Position
2006-04-06 Position
2006-10-11
2006-10-17 Position
2006-10-18
2006-10-18 Position
2006-10-18
2006-10-18 Position
2006-10-18
2006-10-18 Position
2006-10-18 Position
2006-10-18 Position
2006-10-18 Position
2006-10-19 Position
出:
Time Pos
2006-01-12 Position
2006-02-01
2006-02-01 Position
2006-02-02
2006-02-02 Position
2006-10-11
2006-10-17 Position
2006-10-18
2006-10-18 Position
2006-10-18
2006-10-18 Position
2006-10-18
2006-10-18 Position
答案 0 :(得分:3)
以下是dplyr
+ data.table::rleid
的解决方案:
library(dplyr)
df %>%
mutate(ID = data.table::rleid(df$Pos)) %>%
group_by(ID) %>%
slice(1) %>%
ungroup() %>%
select(-ID)
<强>结果:强>
# A tibble: 13 x 2
Time Pos
<chr> <chr>
1 2006-01-12 Position
2 2006-02-01
3 2006-02-01 Position
4 2006-02-02
5 2006-02-02 Position
6 2006-10-11
7 2006-10-17 Position
8 2006-10-18
9 2006-10-18 Position
10 2006-10-18
11 2006-10-18 Position
12 2006-10-18
13 2006-10-18 Position
或data.table
等效:
setDT(df)[, .SD[1], by = rleid(Pos), .SDcol = c("Time", "Pos")]
<强>结果:强>
rleid Time Pos
1: 1 2006-01-12 Position
2: 2 2006-02-01
3: 3 2006-02-01 Position
4: 4 2006-02-02
5: 5 2006-02-02 Position
6: 6 2006-10-11
7: 7 2006-10-17 Position
8: 8 2006-10-18
9: 9 2006-10-18 Position
10: 10 2006-10-18
11: 11 2006-10-18 Position
12: 12 2006-10-18
13: 13 2006-10-18 Position
数据:强>
df = structure(list(Time = c("2006-01-12", "2006-01-16", "2006-01-17",
"2006-02-01", "2006-02-01", "2006-02-02", "2006-02-02", "2006-02-02",
"2006-02-02", "2006-04-04", "2006-04-06", "2006-04-06", "2006-10-11",
"2006-10-17", "2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18",
"2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18", "2006-10-18",
"2006-10-19"), Pos = c("Position", "Position", "Position", "",
"Position", "", "Position", "Position", "Position", "Position",
"Position", "Position", "", "Position", "", "Position", "", "Position",
"", "Position", "Position", "Position", "Position", "Position"
)), .Names = c("Time", "Pos"), class = "data.frame", row.names = c(NA,
-24L))
答案 1 :(得分:2)
df[head(cumsum(c(1, (rle(df$Pos)$lengths))), -1),]
# Time Pos
#1 2006-01-12 Position
#4 2006-02-01
#5 2006-02-01 Position
#6 2006-02-02
#7 2006-02-02 Position
#13 2006-10-11
#14 2006-10-17 Position
#15 2006-10-18
#16 2006-10-18 Position
#17 2006-10-18
#18 2006-10-18 Position
#19 2006-10-18
#20 2006-10-18 Position
答案 2 :(得分:1)
您可以尝试使用滞后:
library(dplyr)
df2 <- df %>%
mutate(pos = ifelse(Pos == "Position", 1, 0),
lag = lag(pos, k=1)) %>%
filter(is.na(lag) | lag == 0)