所以我有以下for循环:
for(i in 1:dim(d)[1])
{
if(d$countryname[i] %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador"))
{next}
else
{d$countryname[i] <- "Others"}
}
“d”数据框有超过650万行,d $ countryname是一个因素。
有没有办法让它更快?这非常慢。谢谢。
答案 0 :(得分:4)
在关卡上工作:
x <- factor(c("a", "a", "b", "b", "c", "d"))
levels(x)[levels(x) %in% c("b", "d")] <- "other"
x
#[1] a a other other c other
#Levels: a other c
这应该很快,因为它避免扫描整个矢量。当然,如果你使用package data.table,你可以更快。
<强>基准强>
set.seed(42)
test <- data.frame(abc = factor(sample(letters, 6.5e6, replace = TRUE)))
#function by user164385
g <- function(test) {
test$log <- test$abc %in% c("a", "e", "i", "o", "u")
test$abc <- ifelse(test$log, test$abc, "x")
test
}
rol <- function(test) {
levels(test$abc)[levels(test$abc) %in% c("a", "e", "i", "o", "u")] <- "other"
test
}
library(microbenchmark)
microbenchmark(test1 <- data.table:::copy(test),
{test1 <- test; g(test1)},
{test1 <- test; rol(test)}, times = 5, unit = "ms")
#Unit: milliseconds
# expr min lq mean median uq max neval cld
# test1 <- data.table:::copy(test) 5.645598 5.848151 6.044557 5.915754 5.964407 6.848877 5 a
# { test1 <- test g(test1) } 1966.524342 1971.394814 1988.507992 1978.835983 1987.284023 2038.500796 5 c
# { test1 <- test rol(test) } 141.646732 152.205054 154.106125 155.589032 159.307184 161.782623 5 b
答案 1 :(得分:2)
在R中使用ifelse
循环可能会非常慢,但是有许多内置的R函数可以提高性能。我最喜欢的是使用country_check <- d$countryname %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador")
d$countryname <- factor(ifelse(country_check, d$countryname, "Others"))
:
test <- data.frame(abc = factor(sample(letters, 100000, replace = TRUE)))
g <- function() {
test$log <- test$abc %in% c("a", "e", "i", "o", "u")
test$abc <- ifelse(test$log, test$abc, "x")
}
f <- function() {
for(i in 1:dim(test)[1]) {
if(test$abc[i] %in% c("a", "e", "i", "o", "u"))
{next}
else
{test$abc[i] <- "x"}
}}
> system.time(g())
user system elapsed
0.04 0.00 0.05
> system.time(f())
user system elapsed
22.51 7.78 30.57
针对循环测试:
string pathfile = @"..\..\Data.xlsx";
string sheetName = "Login";
var excelFile = new ExcelQueryFactory(pathfile);
var abc = from a in excelFile.Worksheet(sheetName).AsEnumerable()
where a["ID"] == "2"
select a;
PropertiesCollection.driver.Manage().Window.Maximize();
foreach (var a in abc)
{
PropertiesCollection.driver.Navigate().GoToUrl(a["URL"]);
}
foreach (var a in abc)
{
objLogin.Login(a["uname"], a["paswd"]);
}
这是一项重大改进,尽管可能有更好的解决方案。我脆弱的小型计算机无法处理数据框中超过100,000行的循环,因此我无法为实际大小的示例提供合适的基准测试。
使用内置函数将其内容隐藏在C代码中通常会比在R中完成所有艰苦工作获得更好的性能结果。
答案 2 :(得分:1)
怎么样:
log <- d$countryname %in% c("Italy","Spain","Canada","Brazil","United States","France","Mexico","Colombia","Peru","Chile","Argentina","Ecuador")
d$countryname[!log] <- "others"