如何在申请中提高申请表现

时间:2017-07-21 23:10:09

标签: r performance parallel-processing apply

在询问R merged loop performance问题一段时间后,我已经达到了下面的代码。

我的.csv文件长1250行,宽2500多列。列中的数据类型事先是未知的,可以是正数或字符串等。

我想要实现的是将具有95%相等项目的行留给任何其他行(2500中的2375列中的任何一列中的相等值)。

最初我试图检查相似之处,但后来我意识到通过消除而不是聚合将需要更少的计算。由于我正在寻找95%的相似性,如果一条线与数据集中的每一条线都有2500-2375 = 125 + 1个不相等的列,那么我可以确定这条线与其他线的相似性永远不会达到95%可以从进一步处理中删除。

下面的代码成功查看所有行的最左边的checkColCount(%6)列,然后删除与所有其他行(如果存在的话)不相似的行,然后移到下一个checkColCount列。 xdat(如果有)中的结果行是具有95%以上相似列的行。

问题是时间,我试图把它放到parApply中,但结果与此不同(中间有很多空行)。我假设,理想情况下我应该能够并行化以下两个应用程序。

注意:我在Windows OS上

工作代码:

xdat <- read.csv("ttt.csv", header =TRUE, stringsAsFactors = FALSE )
#Normally, I'm interested in at least 95% of the data to be **equal** 
perc <- 95

#replacing NA's with unique random negative numbers to make sures that NA's don't appear as identical between rows
xdat[is.na(xdat)] <- -1 * sample(dim(xdat)[1]*dim(xdat)[2], size=sum(is.na(xdat)), replace=FALSE) 
xdat<-rbind(xdat,xdat[1,])

compareRow <- function(a,b)
  {
    sum(a!=b,na.rm=TRUE)
  }
#adding the first row to the end for testing
xdat<-rbind(xdat,xdat[1,])
system.time( try(
for (firstColToCheck in seq(1,100-(100-perc),100-perc)){
  lastColToCheck <- 100-perc +1 + firstColToCheck
  checkColCount <- lastColToCheck- firstColToCheck +1
  xx <- apply(xdat[,firstColToCheck:lastColToCheck],1, function(a)
    {
      apply(xdat[,firstColToCheck:lastColToCheck],1, function(b) {compareRow(a,b)  })
    })
  sumXX <-rowSums(xx)
  checkValue <- checkColCount*(nrow(xdat)-1)
  xdat<-xdat[rowSums(xx)<checkValue,]
  cat(firstColToCheck , " ")
  gc(verbose = FALSE)
} , silent = TRUE )
)
xdat0 <- xdat
#   user  system elapsed
#   5.22    0.00    5.21

我尝试了parApply:

xdat <- read.csv("ttt.csv", header =TRUE, stringsAsFactors = FALSE )
#Normally, I'm interested in at least 95% of the data to be **equal** 
perc <- 95
#replacing NA's with unique random negative numbers to make sures that NA's don't appear as identical between rows
xdat[is.na(xdat)] <- -1 * sample(dim(xdat)[1]*dim(xdat)[2], size=sum(is.na(xdat)), replace=FALSE) 
xdat<-rbind(xdat,xdat[1,])

library(parallel)
no_cores <- detectCores() - 1
cl1 <- makeCluster(no_cores)
clusterExport(cl1, c('compareRow','firstColToCheck','xdat','lastColToCheck','checkColCount'))
clusterEvalQ(cl1, library(parallel))


system.time( try(
  for (firstColToCheck in seq(1,100-(100-perc),100-perc)){
    lastColToCheck <- 100-perc +1 + firstColToCheck
    checkColCount <- lastColToCheck- firstColToCheck +1
    xx <- parApply(cl1,xdat[,firstColToCheck:lastColToCheck],1, function(a)
    {
      apply(xdat[,firstColToCheck:lastColToCheck],1, function(b) {compareRow(a,b)  })
    })
    sumXX <-rowSums(xx)
    checkValue <- checkColCount*(nrow(xdat)-1)

    xdat<-xdat[rowSums(xx)<checkValue,]
    cat(firstColToCheck , " ")
  } , silent = TRUE )
)
stopCluster(cl1)
#   user  system elapsed
#   0.14    0.00    0.63
xdat1 <- xdat

示例数据:

id  group   hs.grad race    gender  age m.status    political   n.kids  income
1   treat   yes white   male    19  never   republican  1   4716
2   control yes black   male    30  divorced    independent 2   4724
3   control yes black   female  32  married republican  3   1096
4   control no  white   male    35  divorced    republican  4   1084
5   control yes white   female  18  married republican  5   4720
6   control yes asian   male    22  married independent 6   2577
7   control yes white   female  26  never   democrat    7   3154
8   control yes asian   male    40  married republican  8   3267
9   control yes asian   female  23  married independent 9   3603
10  treat   yes white   male    19  divorced    republican  1   4716

示例结果假设我不需要95%但70%相等

id  group   hs.grad race    gender  age m.status    political   n.kids  income
1   treat   yes white   male    19  never   republican  1   4716
10  treat   yes white   male    19  divorced    republican  1   4716

只剩下第1行和第10行,因为它们有8列相等(超过70%)所有其他行与任何其他行相差至少4列,因此它们已从集合中删除。

0 个答案:

没有答案