我正在尝试加速R中的一些代码。我认为我的循环方法可以替换(可能使用某种形式的lapply或使用sqldf)但我似乎无法弄清楚如何。
基本前提是我有一个包含~50个子目录的父目录,每个子目录包含~200个CSV文件(总共10,000个CSV)。这些CSV文件中的每一个都包含~86,400行(第二行每天都有数据)。
该脚本的目标是计算每个文件的两个时间间隔的平均值和stdev,然后为每个子目录制作一个摘要图,如下所示:
library(timeSeries)
library(ggplot2)
# list subdirectories in parent directory
dir <- list.dirs(path = "/ParentDirectory", full.names = TRUE, recursive = FALSE)
num <- (length(dir))
# iterate through all subdirectories
for (idx in 1:num){
# declare empty vectors to fill for each subdirectory
DayVal <- c()
DayStd <- c()
NightVal <- c()
NightStd <- c()
date <- as.Date(character())
setwd(dir[idx])
filenames <- list.files(path=getwd())
numfiles <- length(filenames)
# for each file in the subdirectory
for (i in c(1:numfiles)){
day <- read.csv(filenames[i], sep = ',')
today <- as.Date(day$time[1], "%Y-%m-%d")
# setting interval for times of day we care about <- SQL seems like it may be useful here but I couldn't get read.csv.sql to recognize hourly intervals
nightThreshold <- as.POSIXct(paste(today, "03:00:00"))
dayThreshold <- as.POSIXct(paste(today, "15:00:00"))
nightInt <- day[(as.POSIXct(day$time) >= nightThreshold & as.POSIXct(day$time) <= (nightThreshold + 3600)) , ]
dayInt <- day[(as.POSIXct(day$time) >= dayThreshold & as.POSIXct(day$time) <= (dayThreshold + 3600)) , ]
#check some thresholds in the data for that time period
if (sum(nightInt$val, na.rm=TRUE) < 5){
NightMean <- mean(nightInt$val, na.rm =TRUE)
NightSD <-sd(nightInt$val, na.rm =TRUE)
} else {
NightMean <- NA
NightSD <- NA
}
if (sum(dayInt$val, na.rm=TRUE) > 5){
DayMean <- mean(dayInt$val, na.rm =TRUE)
DaySD <-sd(dayInt$val, na.rm =TRUE)
} else {
DayMean <- NA
DaySD <- NA
}
NightVal <- c(NightVal, NightMean)
NightStd <- c(NightStd, NightSD)
DayVal <- c(gsrDayVal, DayMean)
DayStd <- c(gsrDayStd, DaySD)
date <-c(date, as.Date(today))
}
df<-data.frame(date,DayVal,DayStd,NightVal, NightStd)
# plot for the subdirectory
p1 <- ggplot() +
geom_point(data = df, aes(x = date, y = gsrDayVal, color = "Day Average")) +
geom_point(data = df, aes(x = date, y = gsrDayStd, color = "Day Standard Dev")) +
geom_point(data = df, aes(x = date, y = gsrNightVal, color = "Night Average")) +
geom_point(data = df, aes(x = date, y = gsrNightStd, color = "Night Standard Dev")) +
scale_colour_manual(values = c("steelblue", " turquoise3", "purple3", "violet"))
}
非常感谢您提出的任何建议!
答案 0 :(得分:1)
在flatfiles中管理相当多的数据时,请考虑使用SQL数据库解决方案。关系数据库管理系统(RDMS)可以轻松处理数百万条记录,甚至可以根据需要使用其可扩展的数据库引擎进行聚合,而不是根据R内存处理。如果不是为了提高速度和效率,数据库可以提供安全性,健壮性和组织性。中央存储库。甚至可以使用脚本将每个日常csv直接导入数据库。
幸运的是,几乎所有RDMS都有CSV处理程序,可以批量加载多个文件。以下是开源解决方案: SQLite (文件级数据库), MySQL 和 PostgreSQL (两个服务器级数据库),所有这些都在R中具有相应的库。每个示例递归地将csv文件从目录文件列表导入到名为timeseriesdata
的数据库表中(具有与csv文件相同的命名字段/数据类型)。最后是一个SQL调用,用于导入夜间和日间隔均值和标准差的聚合(根据需要调整)。唯一的挑战是指定文件和子目录指示符(实际数据中可能存在或不存在)并附加csv文件(可能在每次迭代后,向FileID
列运行更新查询)。
dir <- list.dirs(path = "/ParentDirectory",
full.names = TRUE, recursive = FALSE)
# SQLITE DATABASE
library(RSQLite)
sqconn <- dbConnect(RSQLite::SQLite(), dbname = "/path/to/database.db")
# (CONNECTION NOT NEEDED DUE TO CMD LINE LOAD BELOW)
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT VIA COMMAND LINE OR BASH (ASSUMES SQLITE3 IS PATH VARIABLE)
cmd <- paste0("(echo .separator ,; echo .import ' ", csvfile , " ' timeseriesdata ')",
" '| sqlite3 ' /path/to/databasename.db")
system(cmd)
}
}
# CLOSE CONNNECTION
dbDisconnect(sqconn)
# MYSQL DATABASE
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING LOAD DATA INFILE COMMAND
sql <- paste0("LOAD DATA INFILE '", csvfile, "'
INTO TABLE timeseriesdata
FIELDS TERMINATED BY ','
ENCLOSED BY '\"'
ESCAPED BY '\"'
LINES TERMINATED BY '\\n'
IGNORE 1 LINES
(col1, col2, col3, col4, col5);")
dbSendQuery(myconn, sql)
dbCommit(myconn)
}
}
# CLOSE CONNECTION
dbDisconnect(myconn)
# POSTGRESQL DATABASE
library(RPostgreSQL)
pgconn <- dbConnect(PostgreSQL(), dbname="databasename", host="myhost",
user= "postgres", password="***")
for (d in dir){
filenames <- list.files(d)
for (f in filenames){
csvfile <- paste0(d, '/', f)
# IMPORT USING COPY COMMAND
sql <- paste("COPY timeseriesdata(col1, col2, col3, col4, col5)
FROM '", csvfile , "' DELIMITER ',' CSV;")
dbSendQuery(pgconn, sql)
}
}
# CLOSE CONNNECTION
dbDisconnect(pgconn)
# CREATE PLOT DATA FRAME (MYSQL EXAMPLE)
# (ADD INSIDE SUBDIRECTORY LOOP OR INCLUDE SUBDIR COLUMN IN GROUP BY)
library(RMySQL)
myconn <- dbConnect(RMySQL::MySQL(), dbname="databasename", host="hostname",
username="username", password="***")
# AGGREGATE QUERY USING TWO DERIVED TABLE SUBQUERIES
# (FOR NIGHT AND DAY, ADJUST FILTERS PER NEEDS)
strSQL <- "SELECT dt.FileID, NightMean, NightSTD, DayMean, DaySTD
FROM
(SELECT nt.FileID, Avg(nt.time) As NightMean, STDDEV(nt.time) As NightSTD
FROM timeseriesdata nt
WHERE nt.time >= '15:00:00' AND nt.time <= '21:00:00'
GROUP BY nt.FileID
HAVING Sum(nt.val) < 5) AS ng
INNER JOIN
(SELECT dt.FileID, Avg(dt.time) As DayMean, STDDEV(dt.time) As DaySTD
FROM timeseriesdata dt
WHERE dt.time >= '03:00:00' AND dt.time <= '09:00:00'
GROUP BY dt.FileID
HAVING Sum(dt.val) > 5) AS dy
ON ng.FileID = dy.FileID;"
df <- dbSendQuery(myconn, strSQL)
dbFetch(df)
dbDisconnect(myconn)
答案 1 :(得分:1)
有一件事就是一天转换一天的时间而不是你现在正在做的所有时间。也可以使用lubridate包,因为如果你有很多次转换,它会比as.POSIXct&#39;快得多。
同样将存储结果的变量(例如DayVal,DayStd)调整为适当的大小(DayVal&lt; - numeric(num)),然后将结果索引到适当的索引中。
如果CSV文件很大,请考虑使用“fread”#39; data.table包中的函数。