我一直在努力为(时间序列)分析正确清理和格式化原始物品价格数据,我很好奇你们这些专业人士如何处理这种设置。每两列代表一个日期列表和一个价格列表。这些日期(不幸的是)独立于同一行中的任何其他日期(尽管偶然可能有许多日期相同)。
我的策略是创建一个新的数据框,其中行代表天数,列代表价格,并运行一个循环,将项目日期与正确的行匹配并填写正确的价格。
但是,我相信我可能效率低下而且我的在线搜索没有给我这个程序的其他例子。
下面请查找示例代码。
df <- structure(list(Date1 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item1 = c(650L, 650L, 635L, 640L, 640L, 625L, 620L, 580L, 550L, 520L, 530L), Date2 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item2 = c(590L, 590L, 590L, 580L, 580L, 580L, 580L, 580L, 460L, 460L, 395L), Date3 = c("12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012", "1/23/2012", "1/30/2012", "2/6/2012", "2/13/2012", "2/20/2012"), Item3 = c(775L, 775L, 775L, 750L, 750L, 750L, 750L, 750L, 725L, 725L, 740L), Date4 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item4 = c(660L, 700L, 700L, 700L, 700L, 700L, 650L, 650L, 650L, 650L, 610L), Date5 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item5 = c(705L, 705L, 705L, 650L, 650L, 650L, 650L, 555L, 555L, 555L, 555L), Date6 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item6 = c(612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L, 612L), Date7 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item7 = c(630L, 630L, 625L, 635L, 625L, 615L, 620L, 560L, 550L, 540L, 530L), Date8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Item8 = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA), Date9 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item9 = c(622L, 622L, 650L, 650L, 650L, 660L, 660L, 660L, 665L, 665L, 665L), Date10 = c("10/31/2011", "11/7/2011", "11/14/2011", "11/21/2011", "11/28/2011", "12/5/2011", "12/12/2011", "12/19/2011", "1/2/2012", "1/9/2012", "1/16/2012"), Item10 = c(1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L, 1040L)), .Names = c("Date1", "Item1", "Date2", "Item2", "Date3", "Item3", "Date4", "Item4", "Date5", "Item5", "Date6", "Item6", "Date7", "Item7", "Date8", "Item8", "Date9", "Item9", "Date10", "Item10"), row.names = 95:105, class = "data.frame")
df
class(df)
# visual inspection for first and last date (10/31/2011, 2/20/2012)
mdyyyy <- function(x){as.Date(x,"%m/%d/%Y")}
days <- seq.Date(from = mdyyyy("10/31/2011"), # first date
to = mdyyyy("2/20/2012"), # last date
by = "day")
head(days)
datecolumns <- seq(1,ncol(df),by=2) # (odds) date columns
pricecolumns <- seq(2,ncol(df),by=2) # (evens) index columns
# Creating a new, cleaned matrix of data where the
# rows = days and columns = indices
newdat <- matrix(NA,
length(days),
ncol(df[,pricecolumns])) # indices wide
# Name rows
rownames(newdat) <- format(days,"%m/%d/%Y")
# Each row is a new day
head(newdat[,1:10])
# Placing prices into the appropriate rows
for(i in 1:length(datecolumns)){
pricedates <- 0 # initialize/reset
pricedates <- mdyyyy(df[,datecolumns[i]]) # column's price dates
rowlocations <- 0 # initialize/reset
rowlocations <- match(pricedates, days) # date's new row number
for(j in 1:length(rowlocations)){
# within each cell, place appropriate price
newdat[rowlocations[j],i] <- df[j,pricecolumns[i]]
}
}
colnames(newdat) <- colnames(df[,pricecolumns])
head(newdat)
之后我一直在寻找xts包来帮助我通过apply.monthly()和rollapply()进行分析,因为原始数据要广泛得多。
非常感谢您的想法和批评。
答案 0 :(得分:0)
这是一种方法,使用数组索引,这是使用值填充矩阵的最有效方法,AFAIK:
## convert data to long format
long <- within(reshape(df,
varying = list(datecolumns, pricecolumns),
v.names = c('Date', 'Item'),
new.row.names = seq(prod(dim(df[datecolumns]))),
times = paste0('Item', seq(datecolumns)),
timevar = 'Id',
direction = 'long')[-4],
Date <- mdyyyy(Date))
long <- na.omit(long) # remove NAs
## create empty matrix
out <- matrix(NA, length(days), length(pricecolumns),
dimnames=list(as.character(days), names(df)[pricecolumns]))
## fill it with values from long
out[with(long, cbind(as.character(Date), Id))] <- long$Item
答案 1 :(得分:0)
不完全确定这是否是你所追求的,但这里有一个方法,它使用dplyr和tidyr包将你的数据结构转换成一个长格式,单独的Date
和Item
(什么我认为是价格)列。无论你想做什么,你应该发现它更容易使用它。请注意,df
是问题中提供的数据框。
library(tidyr)
library(dplyr)
d <- df %>%
mutate(row = 1:n()) %>%
gather(key, value, -row) %>%
extract(key, c("var", "ref"), "(Date|Item)([0-9]*)") %>%
spread(var, value)
head(d)
#> row ref Date Item
#> 1 1 1 10/31/2011 650
#> 2 1 10 10/31/2011 1040
#> 3 1 2 10/31/2011 590
#> 4 1 3 12/5/2011 775
#> 5 1 4 10/31/2011 660
#> 6 1 5 10/31/2011 705
除此之外,这是基于对上一篇文章的回答:Gather multiple sets of columns
如果你想将它传播到一个类似于表格的结构中,这里的内容与上面相同,只有几行:
d <- df %>%
mutate(row = 1:n()) %>%
gather(key, value, -row) %>%
extract(key, c("var", "ref"), "(Date|Item)([0-9]*)") %>%
spread(var, value) %>%
mutate(ref = paste0("Item", ref)) %>%
spread(ref, Item) %>%
select(-row)
head(d)
#> Date Item1 Item10 Item2 Item3 Item4 Item5 Item6 Item7 Item8 Item9
#> 1 10/31/2011 650 1040 590 <NA> 660 705 612 630 <NA> 622
#> 2 12/5/2011 <NA> <NA> <NA> 775 <NA> <NA> <NA> <NA> <NA> <NA>
#> 3 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>
#> 4 11/7/2011 650 1040 590 <NA> 700 705 612 630 <NA> 622
#> 5 12/12/2011 <NA> <NA> <NA> 775 <NA> <NA> <NA> <NA> <NA> <NA>
#> 6 <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA> <NA>