我列出了701个csv
文件。每个列具有相同数量的列(7)但行数不同(在25000和28000之间)。
以下是第一个文件的摘录:
Date,Week,Week Day,Hour,Price,Volume,Sale/Purchase
18/03/2011,11,5,1,-3000.00,17416,Sell
18/03/2011,11,5,1,-1001.10,17427,Sell
18/03/2011,11,5,1,-1000.00,18055,Sell
18/03/2011,11,5,1,-500.10,18057,Sell
18/03/2011,11,5,1,-500.00,18064,Sell
18/03/2011,11,5,1,-400.10,18066,Sell
18/03/2011,11,5,1,-400.00,18066,Sell
18/03/2011,11,5,1,-300.10,18068,Sell
18/03/2011,11,5,1,-300.00,18118,Sell
现在我试图在Volume
正好Date
的情况下绘制Price
和200.00
。然后我试图找到一个窗口,在那里我可以看到音量随时间的变化。
allenamen <- dir(pattern="*.csv")
alledat <- lapply(allenamen, read.csv, header = TRUE,
sep = ",", stringsAsFactors = FALSE)
verlauf <- function(a) {plot(Volume ~ Date, a,
data=subset(a, (Price=="200.00")),
ylim = c(15000, 45000),
xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")}
lapply(alledat, verlauf)
但是我收到了这个错误:
error in strsplit(log, NULL): non-character argument
如何避免错误?
答案 0 :(得分:2)
以下是一些建议。
使用list.files
而非dir
来查找文件。 dir
用于列出目录中的文件。您使用它的方式是当前目录。
header = TRUE
和sep = ","
是read.csv
的默认参数,因此代码中没有必要。
读取每个文件的子集
这是建议的方法。
> fnames <- list.files(pattern = "*.csv")
> read <- lapply(fnames, function(x){
rd <- read.csv(x, stringsAsFactors = FALSE)
subset(rd, Price == 200)
})
> dat <- do.call(rbind, read)
然后您应该能够绘制dat
。
答案 1 :(得分:2)
如果要将Price==200
的所有子集合并到一个图中,可以使用以下函数:
plotprice <- function(x) {
files <- list.files(pattern="*.csv")
df <- data.frame()
for(i in 1:length(files)){
xx <- read.csv(as.character(files[i]))
xx <- subset(xx, Price==x)
df <- rbind(df, xx)
}
df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
plot(Volume ~ Date, df, ylim = c(15000, 45000), xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")
}
使用plotprice(200)
,您可以在Price==200
的一个图表中找到所有内容。
如果需要每个csv
文件的图表,可以使用:
ploteach <- function(x) {
files <- list.files(pattern="*.csv")
for(i in 1:length(files)){
df <- read.csv(as.character(files[i]))
df <- subset(df, Price==x)
df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
plot(Volume ~ Date, df, ylim = c(15000, 45000), xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")
}
}
ploteach(200)
答案 2 :(得分:0)
好的,首先你需要将lapply的结果 - read.csv从701 csv列表转换为单个数据帧。
增加了读取和子集的功能,以避免耗尽RAM:
#
# function to read and subset data to avoid running out of RAM
read.subset <- function(dateiname){
a <- read.csv(file = dateiname, header = TRUE, sep = ",",
stringsAsFactors = FALSE)
a <- a[a$Price == 200.00,]
print(gc()) # monitor and clean RAM after each file is read
return(a)
}
* 更新2:使用扫描
添加了更快的read.subset实现# function to read and subset data to avoid running out of RAM
read.subset.fast <- function(dateiname){
# get data from csv into a data.frame
a <- scan(file = dateiname,
what = c(list(character()),
rep(list(numeric()),5),
list(character())),
skip = 1, # skip header (equivalent to header = TRUE)
sep = ",")
# transform efficiently list into data.frame
attributes(a) <- list(class = "data.frame",
row.names = c(NA_integer_, length(a[[1]])),
names = scan(file = dateiname,
what = character(),
skip = 0,
nlines = 1, # just read first line to extract column names
sep = ","))
# subset data
a <- a[a$Price == 200.00,]
print(gc())
return(a)
}
#
现在让我们在一个数据框中读取,子集和组合数据:
#
allenamen <- list.files(pattern="*.csv") # updated (@Richard Scriven)
# get a single data frame, instead of a list of 701 data frames
alledat <- do.call(rbind, lapply(allenamen, read.subset.fast))
#
将日期转换为正确的格式:
# get dates in dates format
alledat$Date <- as.Date(as.character(alledat$Date), format="%d/%m/%Y")
然后你很高兴,不需要任何功能。只是绘制它:
plot(Volume ~ Date,
data = alledat,
ylim = range(Volume),
xlim = range(Date),
type = "l")