Question

我列出了701个csv文件。每个列具有相同数量的列（7）但行数不同（在25000和28000之间）。

以下是第一个文件的摘录：

Date,Week,Week Day,Hour,Price,Volume,Sale/Purchase
18/03/2011,11,5,1,-3000.00,17416,Sell
18/03/2011,11,5,1,-1001.10,17427,Sell
18/03/2011,11,5,1,-1000.00,18055,Sell
18/03/2011,11,5,1,-500.10,18057,Sell
18/03/2011,11,5,1,-500.00,18064,Sell
18/03/2011,11,5,1,-400.10,18066,Sell
18/03/2011,11,5,1,-400.00,18066,Sell
18/03/2011,11,5,1,-300.10,18068,Sell
18/03/2011,11,5,1,-300.00,18118,Sell

现在我试图在Volume正好Date的情况下绘制Price和200.00。然后我试图找到一个窗口，在那里我可以看到音量随时间的变化。

allenamen <- dir(pattern="*.csv")
alledat <- lapply(allenamen, read.csv, header = TRUE, 
   sep = ",", stringsAsFactors = FALSE)
verlauf <- function(a) {plot(Volume ~ Date, a, 
  data=subset(a, (Price=="200.00")), 
  ylim = c(15000, 45000), 
  xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")}
lapply(alledat, verlauf)

但是我收到了这个错误：

error in strsplit(log, NULL): non-character argument

如何避免错误？

Answer 1

以下是一些建议。

使用list.files而非dir来查找文件。 dir用于列出目录中的文件。您使用它的方式是当前目录。
header = TRUE和sep = ","是read.csv的默认参数，因此代码中没有必要。
读取每个文件的子集

这是建议的方法。

> fnames <- list.files(pattern  = "*.csv")
> read <- lapply(fnames, function(x){
    rd <- read.csv(x, stringsAsFactors = FALSE)
    subset(rd, Price == 200)
    })
> dat <- do.call(rbind, read)

然后您应该能够绘制dat。

Answer 2

如果要将Price==200的所有子集合并到一个图中，可以使用以下函数：

plotprice <- function(x) {
  files <- list.files(pattern="*.csv")
  df <- data.frame()
  for(i in 1:length(files)){
    xx <- read.csv(as.character(files[i]))
    xx <- subset(xx, Price==x)
    df <- rbind(df, xx)
  }
  df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
  plot(Volume ~ Date, df, ylim = c(15000, 45000), xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")
}

使用plotprice(200)，您可以在Price==200的一个图表中找到所有内容。

如果需要每个csv文件的图表，可以使用：

ploteach <- function(x) {
  files <- list.files(pattern="*.csv")
  for(i in 1:length(files)){
    df <- read.csv(as.character(files[i]))
    df <- subset(df, Price==x)
    df$Date <- as.Date(as.character(df$Date), format="%d/%m/%Y")
    plot(Volume ~ Date, df, ylim = c(15000, 45000), xlim = as.Date(c("2011-12-30", "2013-01-20")), type = "l")
  }
}

ploteach(200)

Answer 3

好的，首先你需要将lapply的结果 - read.csv从701 csv列表转换为单个数据帧。

增加了读取和子集的功能，以避免耗尽RAM：

#
# function to read and subset data to avoid running out of RAM
read.subset <- function(dateiname){
   a <- read.csv(file = dateiname, header = TRUE, sep = ",",
                 stringsAsFactors = FALSE)
   a <- a[a$Price == 200.00,]
   print(gc())    # monitor and clean RAM after each file is read
   return(a)
}

* 更新2：使用扫描

添加了更快的read.subset实现

# function to read and subset data to avoid running out of RAM
read.subset.fast <- function(dateiname){
   # get data from csv into a data.frame
   a <- scan(file          = dateiname,
             what          = c(list(character()),
                               rep(list(numeric()),5),
                               list(character())),
             skip          = 1,  # skip header (equivalent to header = TRUE)
             sep           = ",")
   # transform efficiently list into data.frame
   attributes(a) <- list(class      = "data.frame",
                         row.names  = c(NA_integer_, length(a[[1]])),
                         names      = scan(file          = dateiname,
                                           what          = character(),
                                           skip          = 0,  
                                           nlines        = 1,  # just read first line to extract column names
                                           sep           = ","))
   # subset data
   a <- a[a$Price == 200.00,]
   print(gc())
   return(a)
}
#

现在让我们在一个数据框中读取，子集和组合数据：

#
allenamen <- list.files(pattern="*.csv") # updated (@Richard Scriven)
# get a single data frame, instead of a list of 701 data frames
alledat <- do.call(rbind, lapply(allenamen, read.subset.fast))
#

将日期转换为正确的格式：

# get dates in dates format
alledat$Date <- as.Date(as.character(alledat$Date), format="%d/%m/%Y")

然后你很高兴，不需要任何功能。只是绘制它：

plot(Volume ~ Date, 
     data = alledat,
     ylim = range(Volume),
     xlim = range(Date),
     type = "l")

在一个窗口中绘制许多csv文件

3 个答案: