我想总结一个csv文件,它有太多行以适应内存。这就是我想做的事情
library(plyr)
dat = read.csv("./myfile.csv",stringsAsFactors=FALSE,header = TRUE)
dat2 = ddply(dat,~colA+colB,summarise,mean=mean(colC),se=sd(colC)/sqrt(length(colC)))
我可以使用readlines将代码更改为逐行读取,但在该方案中不再清楚如何使用ddply。
答案 0 :(得分:4)
不在ddply
。
有很多选择。
RODBC
/ sqldf
/ dplyr
)data.table
)请参阅https://stackoverflow.com/a/4335739/1385941
library(sqldf)
# create database
sqldf("attach my_db as new")
# read data from csv directly to database
read.csv.sql("./myfile.csv", sql = "create table main.mycsv as select * from file",
dbname = "my_db")
# perform the query in SQL
dat2 <- sqldf("Select ColA, ColB, mean(ColC) as mean, stdev(ColC) / sqrt(count(*)) from main.mycsv",
dbname = "my_db")
dplyr
(完全重写plyr的ddply
类似设施)library(dplyr)
library(RSQLite)
# reference database (created in previous example)
my_db <- src_sqlite('my_db')
# reference the table created from mycsv.csv
dat <- tbl(my_db ,"mycsv")
dat2 <- dat %>%
group_by(ColA, ColB) %>%
summarize(mean = mean(ColC), se = sd(ColC) / sqrt(n()))
# fread is a fast way to read in files!
dat <- fread('./myfile.csv')
dat2 <- dat[,list(mean=mean(colC),se=sd(colC)/sqrt(.N)),by = list(ColA,ColB))