我在R中有2个数据帧,其中一个是另一个的子集。我必须在其中进行一些操作,并从主数据框中计算6个x值(代码中的DayTreat)的子集数据的百分比。因此,我创建了一个函数来进行计算并创建一个新列。我的问题是,它运行缓慢。有什么建议么?
@Bean
public OAuth2RestTemplate oauthRestTemplate() {
ClientCredentialsResourceDetails resourceDetails = new
ClientCredentialsResourceDetails();
resourceDetails.setGrantType("client_credentials");
resourceDetails.setAccessTokenUri("url");
resourceDetails.setClientId("id");
resourceDetails.setClientSecret("ser");
// Set scopes
List<String> scopes = new ArrayList<>();
scopes.add("openid");
resourceDetails.setScope(scopes);
OAuth2RestTemplate obj = new OAuth2RestTemplate(resourceDetails, new DefaultOAuth2ClientContext());
logger.info("obj---------------------------"+obj.getAccessToken());
return obj;
}
答案 0 :(得分:1)
检查您的代码,看来您正在执行多余的计算 该行:
for (i in fullDat$DayTreat)
应为:
for (i in unique(fullDat$DayTreat))
之后,您可以使用data.table而不使用单独的数据帧, 如果您说一个是onother的子集
require(data.table)
setDT(fullDat)
fullDat[, subsetI := Abundance > 30] # for example, should be your Condition
fullDat[, DaySum:= sum(Abundance), by = DayTreat]
fullDat[, DayPerc := Abundance/DaySum]
# get subset:
fullDat[subsetI == T]
如果要提供示例数据和所需的输出,则可以提供更具体的代码。
答案 1 :(得分:0)
因此,从较高的角度来看,我认为解决方案是:
示例:
require(tidyverse)
require(data.table)
percDay <- function(fullDat, subDat)
{
subDat$DaySum <- NULL
for (i in fullDat$DayTreat) # for each DayTreat value in fullDat. Must be `psmelt()` made phyloseq object
{
r <- sum(fullDat$Abundance[fullDat$DayTreat == i]) # Take the sum of all the taxa for that day
subDat$DaySum[subDat$DayTreat == i] <- r # Add the value to the subset of data
}
subDat$DayPerc <- (subDat$Abundance/subDat$DaySum) # Make the percentage of the subset
subDat
}
# My simulation of your data.frame:
fullDat <- data.frame(Abundance=rnorm(200),
DayTreat=c(1:100,1:100))
subDat <- dplyr::sample_frac(fullDat, .25)
# Your function modifies the data, so I'll make a copy. For a potential
# speed improvement I'll try data.table class
fullDat0 <- as.data.table(fullDat)
subDat0 <- as.data.table(subDat)
require(rbenchmark)
benchmark("original" = {
percDay(fullDat, subDat)
},
"example_improvement" = {
# Tidy approach
tmp <- fullDat0 %>%
group_by(DayTreat) %>%
summarize(DaySum = sum(Abundance))
subDat0 <- merge(subDat, tmp, by="DayTreat") # could use semi_join
subDat0$DayPerc <- (subDat0$Abundance/subDat0$DaySum) # could use mutate
},
replications = 100,
columns = c("test", "replications", "elapsed",
"relative", "user.self", "sys.self"))
test replications elapsed relative user.self sys.self example_improvement 100 0.22 1.000 0.22 0.00 original 100 1.42 6.455 1.23 0.01
通常,data.table方法将具有最快的速度。基于小标题的“整洁”方法语法更清晰,同时通常比data.frame快,但比data.table慢。像@akrun这样的有经验的data.table专家可能只使用一条data.table语句即可提供最佳性能。