Question

我有几个包含每小时数据的.csv文件。每个文件代表来自空间点的数据。每个文件的开始和结束日期都不同。

可以使用以下方法将数据读入R：

lstf1<- list.files(pattern=".csv")

lst2<- lapply(lstf1,function(x) read.csv(x,header = TRUE,stringsAsFactors=FALSE,sep = ",",fill=TRUE, dec = ".",quote = "\""))

head(lst2[[800]])
             datetime precip code
1 2003-12-30 00:00:00     NA    M
2 2003-12-30 01:00:00     NA    M
3 2003-12-30 02:00:00     NA    M
4 2003-12-30 03:00:00     NA    M
5 2003-12-30 04:00:00     NA    M
6 2003-12-30 05:00:00     NA    M

datetime为YYYY-MM-DD-HH-MM-SS，precip为数据值，code可以忽略。

对于lst2中的每个数据框（df），我想根据以下条件选择2015-04-01到2015-11-30期间的数据：

1）如果precip中的df包含此期间内的所有NAs，请将其删除（不要选择） 2）如果precip不是全部NAs，请选择它。

所需的输出（lst3）包含期间2015-04-01到2015-11-30的子设置数据。

lst3中的所有数据框都应与days和hours的长度相等，precip表示为NA

我可以使用以下内容将lst3中的文件写入我的目录：

sapply(names(lst2),function (x)  write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))

The link to a sample file can be found here (~200 KB)

Answer 1

根据您所编写的内容，如果数据存在于此特定日期范围的 cliff 列中

> valuesExist <- function(df,start="2015-04-01 0:00:00",end="2015-11-30 23:59:59"){
+ sub.df <- df[df$datetime>=start & df$datetime>=end,]
+ if(sum(is.na(sub.df$precip)==nrow(df)){return(FALSE)}else{return(TRUE)}
+ }
> lst2.bool <- lapply(lst2, valuesExist)
> lst2 <- lst2[lst2.bool]
> lst3 <- lapply(lst2, function(x) {x[x$datetime>="2015-04-01 0:00:00" & x$datetime>="2015-11-30 23:59:59",]}
> sapply(names(lst2), function (x)  write.csv(lst3[[x]],file = paste0(names(lst2[x]), ".csv"),row.names = FALSE))

如果您想拥有动态的开始和结束时间，请将带有这些值的变量抛入valueExist函数，并将lst3赋值中的字符串时间戳替换为该变量。

如果你想将两个lapply循环合并为一个，那就是我的guest，但我更喜欢在我进行子集化时有一个布尔变量。

Answer 2

有点难以理解你想要做什么，但是这个例子（使用dplyr，它有很好的过滤器语法）对你提供的文件应该让你关闭：

library(dplyr)
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")

# Get the required date range and delete the NAs
df.sub <- filter(df, !is.na(precip), 
                     datetime >= as.POSIXct("2015-04-01"),
                     datetime < as.POSIXct("2015-12-01"))

# Check if the subset has any rows left (it will be empty if it was full of NA for precip)
if nrow(df.sub > 0) {
    df.result <- filter(df, datetime >= as.POSIXct("2015-04-01"), 
                            datetime < as.POSIXct("2015-12-01"))
    # Then add df.result to your list of data frames...
} # else, don't add it to your list

我认为你是说你想在数据框中保留NA，如果还有有效的沉降值 - 你只想丢弃整个时期的NA。如果您只想删除所有NAs，那么只需使用第一个过滤语句即可完成。如果您已经以另一种方式正确编码了日期，那么您显然不需要使用POSIXct。

编辑：w /函数包装器，所以你可以使用lapply：

library(dplyr)

# Get some example data
df <- read.csv ("L112FN0M.262.csv")
df$datetime <- as.POSIXct(df$datetime, format="%d/%m/%Y %H:%M")
dfnull <- df
dfnull$precip <- NA

# list of 3 input data frames to test, 2nd one has precip all NA
df.list <- list(df, dfnull, df)  

# Function to do the filtering; returns list of data frames to keep or null
filterprecip <- function(d) {
    if (nrow(filter(d, !is.na(precip), datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01"))) > 
        0) {
        return(filter(d, datetime >= as.POSIXct("2015-04-01"), datetime < as.POSIXct("2015-12-01")))
    }
}

# Function to remove NULLS in returned list
# (Credit to Hadley Wickham: http://tolstoy.newcastle.edu.au/R/e8/help/09/12/8102.html)
compact <- function(x) Filter(Negate(is.null), x) 

# Filter the list
results <- compact(lapply(df.list, filterprecip))

# Check that you got a list of 2 data frames in the right date range
str(results)

条件基于日期R的条件数据子集

2 个答案: