我有一个特别大的数据集,包含3.7 mio行和76个字符串列。
我想将上面的行与下面的行进行比较,看它们是否匹配并编写了这段代码。应指出上行和下行的相同模式的数量。
a <- c("a","a","a","a","a","a","a","a","a")
b <- c("b","b","b","b","a","b","b","b","b")
c <- c("c","c","c","c","a","a","a","b","b")
d <- c("d","d","d","d","d","d","d","d","d")
features_split <- data.frame(a,b,c,d); features_split
ncol = max(sapply(features_split,length))
safe <- as.data.table(lapply(1:ncol,function(i)sapply(features_split,"[",i)))
nrow(safe)
df <- safe
LIST <-list()
LIST2 <-list()
for(i in 1:(nrow(df)-1))
{
LIST[[i]] <-df[i+1,] %in% df[i,]
LIST2[[i]] <- length(LIST[[i]][LIST[[i]]==TRUE])
}
safe2 <- unlist(LIST2)
not_available <- rowSums(!is.na(safe))
运行该循环需要永远。我该如何改进? (大约1小时,100,000行,但我有超过3.7百万)
感激任何事情, 托比
答案 0 :(得分:2)
概念证明,使用data.frame
:
set.seed(4)
nr <- 1000
mydf <- data.frame(a=sample(letters[1:3], nr, repl=TRUE),
b=sample(letters[1:3], nr, repl=TRUE),
c=sample(letters[1:3], nr, repl=TRUE),
d=sample(letters[1:3], nr, repl=TRUE),
stringsAsFactors=FALSE)
matches <- vapply(seq.int(nrow(mydf)-1),
function(ii,zz) sum(mydf[ii,] == mydf[ii+1,]),
integer(1))
head(matches)
## [1] 0 3 4 2 1 0
sum(matches == 4) # total number of perfect row-matches
## 16
在matches
中,位置i
中的整数表示行i
中的字符串与行i+1
中的相应字符串完全匹配。匹配0
表示根本没有匹配,并且(在这种情况下)4
表示该行是完美匹配。
为了示范时间而把它放得更大:
nr <- 100000
nc <- 76
mydf2 <- as.data.frame(matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc),
stringsAsFactors=FALSE)
dim(mydf2)
## [1] 100000 76
system.time(
matches2 <- vapply(seq.int(nrow(mydf2)-1),
function(ii) sum(mydf2[ii,] == mydf2[ii+1,]),
integer(1))
)
## user system elapsed
## 370.63 12.14 385.36
如果您能够将其作为矩阵(因为您拥有&#34;字符&#34;的同质数据类型)而不是data.frame,那么您可以获得更好的表现:
nr <- 100000
nc <- 76
mymtx2 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx2)
## [1] 10000 76
system.time(
matches2 <- vapply(seq.int(nrow(mymtx2)-1),
function(ii) sum(mymtx2[ii,] == mymtx2[ii+1,]),
integer(1))
)
## user system elapsed
## 0.81 0.00 0.81
(与上一次运行中的370.63 user
比较。)将其扩大到全力:
nr <- 3.7e6
nc <- 76
mymtx3 <- matrix(sample(letters[1:4], nr*nc, repl=TRUE), nc=nc)
dim(mymtx3)
## [1] 3700000 76
system.time(
matches3 <- vapply(seq.int(nrow(mymtx3)-1),
function(ii) sum(mymtx3[ii,] == mymtx3[ii+1,]),
integer(1))
)
## user system elapsed
## 35.32 0.05 35.81
length(matches3)
## [1] 3699999
sum(matches3 == nc)
## [1] 0
不幸的是,仍然没有比赛,但我认为对于3.7M而言,36秒比100K的一小时要好得多。 (如果我做出了错误的假设,请纠正我。)
(参考:win7 x64,R-3.0.3-64bit,intel i7-2640M 2.8GHz,8GB RAM)