目前,我遇到计算时间问题,因为我在R中运行三重for循环,以便在每周的每一天创建一个异常阈值,并为每个唯一ID创建一个小时级别。
我的原始数据框: 唯一ID,活动日期时间,活动日期,活动日期,活动时间,数字变量1,数字变量2等
df <- read.csv("mm.csv",header=TRUE,sep=",")
for (i in unique(df$customer_id)) {
#I initialize the output data frame so I can rbind as I loop though the grains. This data frame is always emptied out once we move onto our next customer_id
output.final.df <- data_frame(seller_name = factor(), is_anomaly_date = integer(), event_date_hr = double(), event_day_of_wk = integer(), event_day = double(), ...)
for (k in unique(df$event_day_of_wk)) {
for (z in unique(df$event_hr)) {
merchant.df = df[df$merchant_customer_id==i & df$event_day_of_wk==k & df$event_hr==z,10:19] #columns 10:19 are the 9 different numeric variables I am creating anomaly thresholds
#1st anomaly threshold - I have multiple different anomaly thresholds
# TRANSFORM VARIABLES - sometime within the for loop I run another loop that transforms the subset of data within it.
for(j in names(merchant.df)){
merchant.df[[paste(j,"_log")]] <- log(merchant.df[[j]]+1)
#merchant.df[[paste(j,"_scale")]] <- scale(merchant.df[[j]])
#merchant.df[[paste(j,"_cube")]] <- merchant.df[[j]]**3
#merchant.df[[paste(j,"_cos")]] <- cos(merchant.df[[j]])
}
mu_vector = apply( merchant.df, 2, mean )
sigma_matrix = cov( merchant.df, use="complete.obs", method='pearson' )
inv_sigma_matrix = ginv(sigma_matrix)
det_sigma_matrix = det( sigma_matrix )
z_probas = apply( merchant.df, 1, mv_gaussian, mu_vector, det_sigma_matrix, inv_sigma_matrix )
eps = quantile(z_probas,0.01)
mv_outliers = ifelse( z_probas<eps, TRUE, FALSE )
#2nd anomaly threshold
nov = ncol(merchant.df)
pca_result <- PCA(merchant.df,graph = F, ncp = nov, scale.unit = T)
pca.var <- pca_result$eig[['cumulative percentage of variance']]/100
lambda <- pca_result$eig[, 'eigenvalue']
anomaly_score = (as.matrix(pca_result$ind$coord) ^ 2) %*% (1 / as.matrix(lambda, ncol = 1))
significance <- c (0.99)
thresh = qchisq(significance, nov)
pca_outliers = ifelse( anomaly_score > thresh , TRUE, FALSE )
#This is where I bind the anomaly points with the original data frame and then I row bind to the final output data frame then the code goes back to the top and loops through the next hour and then day of the week. Temp.output.df is constantly remade and output.df is slowly growing bigger.
temp.output.df <- cbind(merchant.df, mv_outliers, pca_outliers)
output.df <- rbind(output.df, temp.output.df)
}
}
#Again this is where I write the output for a particular unique_ID then output.df is recreated at the top for the next unique_ID
write.csv(output.df,row.names=FALSE)
}
以下代码显示了我正在做的事情的想法。正如你所看到的那样,我运行3 for for循环,在那里我计算了最低粒度的多个异常检测,即按星期几的小时级别,然后一旦我完成,我将每个唯一的customer_id级别输出到csv。
整体而言,代码运行速度非常快;然而,做一个三重for循环正在扼杀我的表现。鉴于我的原始数据框架并且需要在每个unique_id级别输出csv,有没有人知道我可以做这样的操作?
答案 0 :(得分:1)
dplyr::group_by(customer_id, event_day_of_wk, event_hr)
或等效的data.table
。两者都应该更快。rbind
和cbind
的每次迭代中明确附加,这会导致您的表现无效。cbind()
您的整个输入df;您唯一的实际输出是mv_outliers, pca_outliers
;您可以join()
customer_id, event_day_of_wk, event_hr
输入和输出dfs
customer_id
的所有结果,然后write.csv()
,这需要进入外部分组级别,并且group_by(event_day_of_wk, event_hr)
位于内部级别。< / LI>
# Here is pseudocode, you can figure out the rest, do things incrementally
# It looks like seller_name, is_anomaly_date, event_date_hr, event_day_of_wk, event_day,... are variables from your input
require(dplyr)
output.df <- df %>%
group_by(customer_id) %>%
group_by(event_day_of_wk, event_hr) %>%
# columns 10:19 ('foo','bar','baz'...) are the 9 different numeric variables I am creating anomaly thresholds
# Either a) you can hardcode their names in mutate(), summarize() calls
# or b) you can reference the vars by string in mutate_(), summarize_() calls
# TRANSFORM VARIABLES
mutate(foo_log = log1p(foo), bar_log = log1p(bar), ...) %>%
mutate(mu_vector = c(mean(foo_log), mean(bar_log)...) ) %>%
# compute sigma_matrix, inv_sigma_matrix, det_sigma_matrix ...
summarize(
z_probas=mv_gaussian(mu_vector, det_sigma_matrix, inv_sigma_matrix),
eps = quantile(z_probas,0.01),
mv_outliers = (z_probas<eps)
) %>%
# similarly, use mutate() and do.call() for your PCA invocation...
# Your outputs are mv_outliers, pca_outliers
# You don't necessarily need to `cbind(merchant.df, mv_outliers, pca_outliers)` i.e. cbind all your input data together with your output
# Now remove all your temporary variables from your output:
select(-foo_log, -bar_log, ...) %>%
# or else just select(mv_outliers, pca_outliers) the variables you want to keep
ungroup() %>% # (this ends the group_by(event_day_of_wk, event_hr) and cbinds all the intermediate dataframes for you)
write.csv( c(.$mv_outliers, .$pca_outliers), file='<this_customer_id>.csv')
ungroup() # group_by(customer_id)