对于输入数据框中的每一行,我想创建一个新数据框,其中包含数据中具有相同RETAIL_WEEK值和DAY_OF_WEEK值的所有行。例如,如果我有matchday = matchweek = 3,则可以使用以下内容来查找所需的数据框:
library(sqldf); library(gsubfn) # second one may not be needed.
fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek")
CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
1 1/20/2009 3 2009 2334547 FALSE 3
2 1/19/2010 3 2010 9854269 FALSE 3
3 1/18/2011 3 2011 1951332 FALSE 3
4 1/17/2012 3 2012 8419327 TRUE 3
5 1/15/2013 3 2013 7788004 TRUE 3
6 1/14/2014 3 2014 2130731 TRUE 3
但我希望滚动行并希望返回一个数据框列表,其中每个数据框由该特定行的匹配组成。由于某种原因,此代码不会产生所需的输出:
find_dates <- function(file,length){
data <- alignment(file)
matches <- list()
#extract dataset from file and split by aligned dates
for (i in 1:8){
#find matching days with corresponding day_of_week and retail_week
matchday <- data[i,]$DAY_OF_WEEK
matchweek <- data[i,]$RETAIL_WEEK
matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek")
}
return(matches)
}
但后来我
[[1]]
character(0)
[[2]]
character(0)
[[3]]
character(0)
[[4]]
[1] "1/4/2009" "1/3/2010" "1/2/2011" "1/1/2012" "12/30/2012" "12/29/2013"
[[5]]
[1] "1/11/2009" "1/10/2010" "1/9/2011" "1/8/2012" "1/6/2013" "1/5/2014"
[[6]]
[1] "1/18/2009" "1/17/2010" "1/16/2011" "1/15/2012" "1/13/2013" "1/12/2014"
[[7]]
[1] "1/25/2009" "1/24/2010" "1/23/2011" "1/22/2012" "1/20/2013" "1/19/2014"
[[8]]
[1] "2/1/2009" "1/31/2010" "1/30/2011" "1/29/2012" "1/27/2013" "1/26/2014"
Warning messages:
1: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
2: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
3: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
4: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
5: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
6: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
7: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
8: In matches[i] <- fn$sqldf("select * from data where RETAIL_WEEK = $matchday and DAY_OF_WEEK = $matchweek") :
number of items to replace is not a multiple of replacement length
有人可以告诉我这里哪里出错吗?我也不明白为什么我会收到警告。
答案 0 :(得分:1)
通常,如果要将R中的数据帧拆分为数据帧列表,则可以使用split
函数。例如,如果您想根据零售周和星期几分割一些样本数据,您可以使用:
# Sample data
data <- read.table(text=" CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
1 1/20/2009 3 2009 2334547 FALSE 1
2 1/19/2010 3 2010 9854269 FALSE 1
3 1/18/2011 3 2011 1951332 FALSE 2
4 1/17/2012 4 2012 8419327 TRUE 2
5 1/15/2013 4 2013 7788004 TRUE 2
6 1/14/2014 4 2014 2130731 TRUE 1", header=TRUE)
spl <- split(data, paste(data$RETAIL_WEEK, data$DAY_OF_WEEK))
# $`3 1`
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009 3 2009 2334547 FALSE 1
# 2 1/19/2010 3 2010 9854269 FALSE 1
#
# $`3 2`
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 3 1/18/2011 3 2011 1951332 FALSE 2
#
# $`4 1`
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 6 1/14/2014 4 2014 2130731 TRUE 1
#
# $`4 2`
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012 4 2012 8419327 TRUE 2
# 5 1/15/2013 4 2013 7788004 TRUE 2
这是一个列表,其中列表的名称等于零售周,后跟一个空格,后跟一周中的某一天。您可以使用以下方式访问各个数据框:
spl[["3 1"]]
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009 3 2009 2334547 FALSE 1
# 2 1/19/2010 3 2010 9854269 FALSE 1
在您的问题中,您实际上需要一个数据框列表,其中每个条目对应于原始数据框中的一行,而数据是具有相同零售周和星期几的所有行。现在可以通过列表上的简单索引来完成此操作:
(processed <- unname(spl[paste(data$RETAIL_WEEK, data$DAY_OF_WEEK)]))
# [[1]]
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009 3 2009 2334547 FALSE 1
# 2 1/19/2010 3 2010 9854269 FALSE 1
#
# [[2]]
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 1 1/20/2009 3 2009 2334547 FALSE 1
# 2 1/19/2010 3 2010 9854269 FALSE 1
#
# [[3]]
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 3 1/18/2011 3 2011 1951332 FALSE 2
#
# [[4]]
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012 4 2012 8419327 TRUE 2
# 5 1/15/2013 4 2013 7788004 TRUE 2
#
# [[5]]
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 4 1/17/2012 4 2012 8419327 TRUE 2
# 5 1/15/2013 4 2013 7788004 TRUE 2
#
# [[6]]
# CAL_DT RETAIL_WEEK RETAIL_YEAR METRIC_AMT ANOMALY DAY_OF_WEEK
# 6 1/14/2014 4 2014 2130731 TRUE 1
如您所见,第一个和第二个条目都有前两行,第三个条目只有第三行,等等。您可以使用processed[[1]]
,{{1}访问特定行的数据框},...
虽然显然我无法对此进行测试,因为我没有你的数据,我会想象将开头的所有数据加载到数据框中,拆分,最后抓取相应的部分会更快而不是为每个输入行执行单独的SQL查询。