R解析并选择包含日期的目录

时间:2015-04-08 17:22:23

标签: r function parsing sorting date

我是 R 的新手,并尝试创建一个 R 函数,该函数将解析具有多个子目录的目录,这些子目录按时间段命名。我想确定可以选择哪个最小的子目录集来形成一个连续的'时间段。 该函数将返回一个字符向量,用于选择感兴趣的子目录。

一个例子: 假设目录"〜"包含以下6个子目录,其中起始和结束日期位于" ddmmyy"格式):

- "01_231014_190115" 
- "02_231014_190215" 
- "03_190215_200215"
- "04_200215_220215"
- "05_220215_130315" 
- "06_220215_270315"

该函数将返回:

"02_231014_190215", "03_190215_200215", "04_200215_220215", "06_220215_270315"

我直到测试才用这个代码以干净的方式确定类似的开始和结束日期:

foldernames  <- list.files( "~") 
listsplitted <- strsplit(foldernames,"_") 
df <- data.frame(matrix(unlist(listsplitted), nrow=length(foldernames), byrow=T)) 
colnames(df) <- c("ID","D.start","D.end") 
df[, 2:3]    <- lapply(df[, 2:3], as.Date, format = "%d%m%y",origin="01-01-2000")
df$d.range   <- df[, 3]- df[, 2]

目前返回:

> df
  ID    D.start      D.end  d.range
1 01 2014-10-23 2015-01-19  88 days
2 02 2014-10-23 2015-02-19 119 days
3 03 2015-02-19 2015-02-20   1 days
4 04 2015-02-20 2015-02-22   2 days
5 05 2015-02-22 2015-03-13  19 days
6 06 2015-02-22 2015-03-27  33 days

我很感激这方面的一点帮助。

1 个答案:

答案 0 :(得分:0)

编辑:这可能是一种方法。

在这里,我从您的问题中创建了file_list。但您可以使用list.dirs()函数获取目录列表,其中recursive = FALSE以防止在目录中列出子目录。

#dir_list = list.dirs(path = ".", recursive = FALSE)

dir_list = c("01_231014_190115", "02_231014_190215" , "03_190215_200215", "04_200215_220215", "05_220215_130315" , "06_220215_270315")

df1 <- data.frame(ID = integer(), D.start = character(), D.end = character(), d.range = numeric(), stringsAsFactors = FALSE)

counter = 0

for( i in dir_list){

  counter = counter + 1

  id = as.integer(sub("(.*)(_)(.*)(_)(.*)", '\\1', i))

  start_date = sub("(.*)(_)(.*)(_)(.*)", '\\3', i)

  start_date = as.character(as.Date(start_date, format = "%d%m%y", origin="01-01-2000"))

  end_date = sub("(.*)(_)(.*)(_)(.*)", '\\5', i)

  end_date = as.character(as.Date(end_date, format = "%d%m%y", origin="01-01-2000"))

  df1[counter,1] = id
  df1[counter,2:3] = c(start_date, end_date)
  df1[counter,4] = as.numeric(difftime(end_date, start_date))

}

uniq_start_dates = unique(df1[,2])

df3 <- data.frame(ID = integer(), D.start = character(), D.end = character(), d.range = numeric(), stringsAsFactors = FALSE)

for(j in uniq_start_dates){

  df2 = df1[which(df1[,2] %in% j), ]

  df3 <- do.call("rbind", list(df3, head(df2[with(df2, order(d.range, decreasing = TRUE)), ], 1)))
}

rm("counter", "id", "end_date", "start_date", "dir_list", "j", "i", "df1", "df2", "uniq_start_dates")

输出:

print(df1)
  ID    D.start      D.end   d.range
1  1 2014-10-23 2015-01-19  88.04167
2  2 2014-10-23 2015-02-19 119.04167
3  3 2015-02-19 2015-02-20   1.00000
4  4 2015-02-20 2015-02-22   2.00000
5  5 2015-02-22 2015-03-13  18.95833
6  6 2015-02-22 2015-03-27  32.95833

print(df3)
  ID    D.start      D.end   d.range
2  2 2014-10-23 2015-02-19 119.04167
3  3 2015-02-19 2015-02-20   1.00000
4  4 2015-02-20 2015-02-22   2.00000
6  6 2015-02-22 2015-03-27  32.95833