如何根据文件名中的特定字符串选择目录中的文件?

时间:2015-10-22 08:50:00

标签: r

我正在从目录中读取一些netcdf文件到R. netcdf文件是根据数据的某些特定功能的名称。

以下是一个例子:

aa <- c("dayavg_fcst_surf125.011_tmp.1962010100_1962123121.nc",
        "dayavg_fcst_surf125.011_tmp.1972010100_1972123121.nc",
        "dayavg_fcst_surf125.011_tmp.1982010100_1982123121.nc",
        "dayavg_fcst_surf125.011_tmp.1992010100_1992123121.nc",
        "dayavg_fcst_surf125.011_tmp.2002010100_2002123121.nc",
        "dayavg_fcst_surf125.011_tmp.2010010100_2010123121.nc",
        "dayavg_fcst_surf125.011_tmp.2012010100_2012123121.nc",
        "dayavg_fcst_surf125.011_tmp.2014020100_2014022821.nc",
        "dayavg_fcst_surf125.011_tmp.2014120100_2014123121.nc",
        "dayavg_fcst_surf125.011_tmp.2015020100_2015022821.nc")

这些是使用list.files函数收集的。

我想选择(保留)这些文件名的子集(作为字符串),特别是引用2010年和2014年收集的数据的文件。

年份在“.tmp”字符串后面的文件名中指明。例如,第一个条目是1962年,依此类推。

为实现这一目标,我尝试了以下方法:

iyears <- c(2010,2014)
ll <- list()
for (i in 1:length(iyears)){
  ll[[i]] <- aa[grepl(iyears[i],aa)]
}
ll <- c(ll[[1]],ll[[2]])

返回:

> ll
 [1] "dayavg_fcst_surf125.011_tmp.1962010100_1962123121.nc" "dayavg_fcst_surf125.011_tmp.1972010100_1972123121.nc"
 [3] "dayavg_fcst_surf125.011_tmp.1982010100_1982123121.nc" "dayavg_fcst_surf125.011_tmp.1992010100_1992123121.nc"
 [5] "dayavg_fcst_surf125.011_tmp.2002010100_2002123121.nc" "dayavg_fcst_surf125.011_tmp.2010010100_2010123121.nc"
 [7] "dayavg_fcst_surf125.011_tmp.2012010100_2012123121.nc" "dayavg_fcst_surf125.011_tmp.2014020100_2014022821.nc"
 [9] "dayavg_fcst_surf125.011_tmp.2014120100_2014123121.nc" "dayavg_fcst_surf125.011_tmp.2015020100_2015022821.nc"
[11] "dayavg_fcst_surf125.011_tmp.2014020100_2014022821.nc" "dayavg_fcst_surf125.011_tmp.2014120100_2014123121.nc"

而答案应该是:

> ll
[1] "dayavg_fcst_surf125.011_tmp.2010010100_2010123121.nc" "dayavg_fcst_surf125.011_tmp.2014020100_2014022821.nc"
[3] "dayavg_fcst_surf125.011_tmp.2014120100_2014123121.nc"

问题是文件名中的日期字符串如下:

YYYYMMDDHH

所以,2010年也出现在

“dayavg_fcst_surf125.011_tmp.1982010100_1982123121.nc”,

由于198 [2 01 0] 1。

有人能建议一种获得所需结果的方法吗?

3 个答案:

答案 0 :(得分:4)

由于tmp.部分似乎是文件名中的常规功能,因此解决此问题的一种非常直接的方法是将其用作搜索字符串的一部分:

> grep("tmp.2010|tmp.2014", aa, value = TRUE)
[1] "dayavg_fcst_surf125.011_tmp.2010010100_2010123121.nc"
[2] "dayavg_fcst_surf125.011_tmp.2014020100_2014022821.nc"
[3] "dayavg_fcst_surf125.011_tmp.2014120100_2014123121.nc"

答案 1 :(得分:2)

主要技巧是指定字符串中实际年份的位置。以下应该有效:

iyears <- c(2010,2014)
ll <- list()

for (i in 1:length(iyears)){
  ll[[i]] <- aa[grepl(paste0("^dayavg_fcst_surf125\\.011_tmp\\.",iyears[i]),aa)] 
}

ll <- c(ll[[1]],ll[[2]])

# [1] "dayavg_fcst_surf125.011_tmp.2010010100_2010123121.nc"
# [2] "dayavg_fcst_surf125.011_tmp.2014020100_2014022821.nc"
# [3] "dayavg_fcst_surf125.011_tmp.2014120100_2014123121.nc"

答案 2 :(得分:0)

为什么不在pattern中使用list.files参数:

  

list.files(path =&#34;。&#34;,pattern = NULL,all.files = FALSE,              full.names = FALSE,recursive = FALSE,              ignore.case = FALSE,include.dirs = FALSE,no .. = FALSE)

     

pattern:可选的正则表达式。只有匹配的文件名   正则表达式将被返回。

参考:R帮助