如何在R中创建数据框列表?

时间:2015-06-05 23:02:52

标签: r sqldf

对于输入数据框中的每一行,我想创建一个新数据框,其中包含数据中具有相同RETAIL_WEEK值和DAY_OF_WEEK值的所有行。例如,如果我有matchday = matchweek = 3,则可以使用以下内容来查找所需的数据框:

library(sqldf); library(gsubfn) # second one may not be needed.
fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek")
     CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
1 1/20/2009           3        2009    2334547   FALSE           3
2 1/19/2010           3        2010    9854269   FALSE           3
3 1/18/2011           3        2011    1951332   FALSE           3
4 1/17/2012           3        2012    8419327    TRUE           3
5 1/15/2013           3        2013    7788004    TRUE           3
6 1/14/2014           3        2014    2130731    TRUE           3

但我希望滚动行并希望返回一个数据框列表,其中每个数据框由该特定行的匹配组成。由于某种原因,此代码不会产生所需的输出:

  find_dates <- function(file,length){
  data <- alignment(file)
  matches <- list()
  #extract dataset from file and split by aligned dates
  for (i in 1:8){
    #find matching days with corresponding day_of_week and retail_week
    matchday <- data[i,]$DAY_OF_WEEK
    matchweek <- data[i,]$RETAIL_WEEK
    matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek")
  }
  return(matches)
}

但后来我

[[1]]
character(0)

[[2]]
character(0)

[[3]]
character(0)

[[4]]
[1] "1/4/2009"   "1/3/2010"   "1/2/2011"   "1/1/2012"   "12/30/2012" "12/29/2013"

[[5]]
[1] "1/11/2009" "1/10/2010" "1/9/2011"  "1/8/2012"  "1/6/2013"  "1/5/2014" 

[[6]]
[1] "1/18/2009" "1/17/2010" "1/16/2011" "1/15/2012" "1/13/2013" "1/12/2014"

[[7]]
[1] "1/25/2009" "1/24/2010" "1/23/2011" "1/22/2012" "1/20/2013" "1/19/2014"

[[8]]
[1] "2/1/2009"  "1/31/2010" "1/30/2011" "1/29/2012" "1/27/2013" "1/26/2014"

Warning messages:
1: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
2: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
3: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
4: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
5: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
6: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
7: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length
8: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
  number of items to replace is not a multiple of replacement length

有人可以告诉我这里哪里出错吗?我也不明白为什么我会收到警告。

1 个答案:

答案 0 :(得分:1)

通常,如果要将R中的数据帧拆分为数据帧列表,则可以使用split函数。例如,如果您想根据零售周和星期几分割一些样本数据,您可以使用:

# Sample data
data <- read.table(text="     CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
1 1/20/2009           3        2009    2334547   FALSE           1
2 1/19/2010           3        2010    9854269   FALSE           1
3 1/18/2011           3        2011    1951332   FALSE           2
4 1/17/2012           4        2012    8419327    TRUE           2
5 1/15/2013           4        2013    7788004    TRUE           2
6 1/14/2014           4        2014    2130731    TRUE           1", header=TRUE)
spl <- split(data, paste(data$RETAIL_WEEK, data$DAY_OF_WEEK))
# $`3 1`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1
# 
# $`3 2`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 3 1/18/2011           3        2011    1951332   FALSE           2
# 
# $`4 1`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 6 1/14/2014           4        2014    2130731    TRUE           1
# 
# $`4 2`
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012           4        2012    8419327    TRUE           2
# 5 1/15/2013           4        2013    7788004    TRUE           2

这是一个列表,其中列表的名称等于零售周,后跟一个空格,后跟一周中的某一天。您可以使用以下方式访问各个数据框:

spl[["3 1"]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1

在您的问题中,您实际上需要一个数据框列表,其中每个条目对应于原始数据框中的一行,而数据是具有相同零售周和星期几的所有行。现在可以通过列表上的简单索引来完成此操作:

(processed <- unname(spl[paste(data$RETAIL_WEEK, data$DAY_OF_WEEK)]))
# [[1]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1
# 
# [[2]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009           3        2009    2334547   FALSE           1
# 2 1/19/2010           3        2010    9854269   FALSE           1
# 
# [[3]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 3 1/18/2011           3        2011    1951332   FALSE           2
# 
# [[4]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012           4        2012    8419327    TRUE           2
# 5 1/15/2013           4        2013    7788004    TRUE           2
# 
# [[5]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012           4        2012    8419327    TRUE           2
# 5 1/15/2013           4        2013    7788004    TRUE           2
# 
# [[6]]
#      CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 6 1/14/2014           4        2014    2130731    TRUE           1

如您所见,第一个和第二个条目都有前两行,第三个条目只有第三行,等等。您可以使用processed[[1]],{{1}访问特定行的数据框},...

虽然显然我无法对此进行测试,因为我没有你的数据,我会想象将开头的所有数据加载到数据框中,拆分,最后抓取相应的部分会更快而不是为每个输入行执行单独的SQL查询。