我有以下 dataframe :
a a a b c c d e a a b b b e e d d
所需的结果应为
a b c d e a b e d
这意味着没有两个连续的行应该具有相同的值。如何在不使用循环的情况下完成。
由于我的数据集非常庞大,因此循环需要花费大量时间来执行。
数据帧结构如下所示
a 1
a 2
a 3
b 2
c 4
c 1
d 3
e 9
a 4
a 8
b 10
b 199
e 2
e 5
d 4
d 10
结果:
a 1
b 2
c 4
d 3
e 9
a 4
b 10
e 2
d 4
它应该删除整行。
答案 0 :(得分:20)
一种简单的方法是使用rle
:
以下是您的示例数据:
x <- scan(what = character(), text = "a a a b c c d e a a b b b e e d d")
# Read 17 items
rle
会返回一个list
,其中包含两个值:游程长度(“lengths
”),以及为该游戏重复的值(“values
”)
rle(x)$values
# [1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
data.frame
如果您正在使用data.frame
,请尝试以下操作:
## Sample data
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10)
)
## Use rle, as before
X <- rle(mydf$V1)
## Identify the rows you want to keep
Y <- cumsum(c(1, X$lengths[-length(X$lengths)]))
Y
# [1] 1 4 5 7 8 9 11 13 15
mydf[Y, ]
# V1 V2
# 1 a 1
# 4 b 2
# 5 c 4
# 7 d 3
# 8 e 9
# 9 a 4
# 11 b 10
# 13 e 2
# 15 d 4
“data.table”包有一个函数rleid
,可让您轻松完成此操作。使用上面的mydf
,尝试:
library(data.table)
as.data.table(mydf)[, .SD[1], by = rleid(V1)]
# rleid V2
# 1: 1 1
# 2: 2 2
# 3: 3 4
# 4: 4 3
# 5: 5 9
# 6: 6 4
# 7: 7 10
# 8: 8 2
# 9: 9 4
答案 1 :(得分:7)
library(dplyr)
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=lag(x, default=1)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
编辑:适用于data.frame
mydf <- data.frame(
V1 = c("a", "a", "a", "b", "c", "c", "d", "e",
"a", "a", "b", "b", "e", "e", "d", "d"),
V2 = c(1, 2, 3, 2, 4, 1, 3, 9,
4, 8, 10, 199, 2, 5, 4, 10),
stringsAsFactors=FALSE)
dplyr解决方案是一个班轮:
mydf %>% filter(V1!= lag(V1, default="1"))
# V1 V2
#1 a 1
#2 b 2
#3 c 4
#4 d 3
#5 e 9
#6 a 4
#7 b 10
#8 e 2
#9 d 4
post scriptum
@Carl Witthoft建议的 lead(x,1)
以相反的顺序迭代。
leadit<-function(x) x!=lead(x, default="what")
rows <- leadit(mydf[ ,1])
mydf[rows, ]
# V1 V2
#3 a 3
#4 b 2
#6 c 1
#7 d 3
#8 e 9
#10 a 8
#12 b 199
#14 e 5
#16 d 10
答案 2 :(得分:6)
对于基础R,我喜欢有趣的算法:
x <- c("a", "a", "a", "b", "c", "c", "d", "e", "a", "a", "b", "b", "b", "e", "e", "d", "d")
x[x!=c(x[-1], FALSE)]
#[1] "a" "b" "c" "d" "e" "a" "b" "e" "d"
答案 3 :(得分:3)
就像我喜欢的那样,...错误,爱 rle
,这里是一次射击:
编辑:无法准确找出dplyr
的内容,因此我使用了dplyr::lead
。我来自OSX,R3.1.2,以及来自CRAN的最新dplyr
。
xlet<-sample(letters,1e5,rep=T)
rleit<-function(x) rle(x)$values
lagit<-function(x) x[x!=lead(x, default=1)]
tailit<-function(x) x[x!=c(tail(x,-1), tail(x,1))]
microbenchmark(rleit(xlet),lagit(xlet),tailit(xlet),times=20)
Unit: milliseconds
expr min lq median uq max neval
rleit(xlet) 27.43996 30.02569 30.20385 30.92817 37.10657 20
lagit(xlet) 12.44794 15.00687 15.14051 15.80254 46.66940 20
tailit(xlet) 12.48968 14.66588 14.78383 15.32276 55.59840 20