用于修改因子

时间:2016-02-05 15:24:37

标签: r performance

所以我有以下for循环:

for(i in 1:dim(d)[1])
{
  if(d$countryname[i] %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador"))
   {next}
   else
   {d$countryname[i] <- "Others"}
}

“d”数据框有超过650万行,d $ countryname是一个因素。

有没有办法让它更快?这非常慢。谢谢。

3 个答案:

答案 0 :(得分:4)

在关卡上工作:

x <- factor(c("a", "a", "b", "b", "c", "d"))
levels(x)[levels(x) %in% c("b", "d")] <- "other"
x
#[1] a     a     other other c     other
#Levels: a other c

这应该很快,因为它避免扫描整个矢量。当然,如果你使用package data.table,你可以更快。

<强>基准

set.seed(42)
test <- data.frame(abc = factor(sample(letters, 6.5e6, replace = TRUE)))
#function by user164385
g <- function(test) {
  test$log <- test$abc %in% c("a", "e", "i", "o", "u")
  test$abc <- ifelse(test$log, test$abc, "x")
  test
}

rol <- function(test) {
  levels(test$abc)[levels(test$abc) %in% c("a", "e", "i", "o", "u")] <- "other"
  test
}

library(microbenchmark)
microbenchmark(test1 <- data.table:::copy(test), 
               {test1 <- test; g(test1)}, 
               {test1 <- test; rol(test)}, times = 5, unit = "ms")
#Unit: milliseconds
#                                expr         min          lq        mean      median          uq         max neval cld
#    test1 <- data.table:::copy(test)    5.645598    5.848151    6.044557    5.915754    5.964407    6.848877     5 a  
#  {     test1 <- test     g(test1) } 1966.524342 1971.394814 1988.507992 1978.835983 1987.284023 2038.500796     5   c
# {     test1 <- test     rol(test) }  141.646732  152.205054  154.106125  155.589032  159.307184  161.782623     5  b 

答案 1 :(得分:2)

在R中使用ifelse循环可能会非常慢,但是有许多内置的R函数可以提高性能。我最喜欢的是使用country_check <- d$countryname %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador") d$countryname <- factor(ifelse(country_check, d$countryname, "Others"))

test <- data.frame(abc = factor(sample(letters, 100000, replace = TRUE)))
g <- function() {
   test$log <- test$abc %in% c("a", "e", "i", "o", "u")
   test$abc <- ifelse(test$log, test$abc, "x")
}
f <- function() {
    for(i in 1:dim(test)[1]) {
        if(test$abc[i] %in% c("a", "e", "i", "o", "u"))
        {next}
    else
    {test$abc[i] <- "x"}
}}

> system.time(g())
   user  system elapsed 
   0.04    0.00    0.05 
> system.time(f())
   user  system elapsed 
  22.51    7.78   30.57 

针对循环测试:

string pathfile = @"..\..\Data.xlsx";
string sheetName = "Login";
var excelFile = new ExcelQueryFactory(pathfile);
var abc = from a in excelFile.Worksheet(sheetName).AsEnumerable() 
          where a["ID"] == "2" 
          select a;
PropertiesCollection.driver.Manage().Window.Maximize();
foreach (var a in abc)
{
PropertiesCollection.driver.Navigate().GoToUrl(a["URL"]);
}
foreach (var a in abc)
{
objLogin.Login(a["uname"], a["paswd"]);
}

这是一项重大改进,尽管可能有更好的解决方案。我脆弱的小型计算机无法处理数据框中超过100,000行的循环,因此我无法为实际大小的示例提供合适的基准测试。

使用内置函数将其内容隐藏在C代码中通常会比在R中完成所有艰苦工作获得更好的性能结果。

答案 2 :(得分:1)

怎么样:

log <- d$countryname %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador")

d$countryname[!log] <- "others"