我正在研究R中的一个项目,我有点卡住了。我有这种格式的四个时间序列:
x <- data.frame(Id = rep(c(1,2,3,4),2),
Date = c(rep("1980-01-01",4), rep("1980-01-02",4)),
Freq = c(2,3,1,2,4,5,2,3))
ID Date Freq
1 1980 - 01 - 01 2
2 1980 - 01 - 01 3
3 1980 - 01 - 01 1
4 1980 - 01 - 01 2
1 1980 - 01 - 02 4
2 1980 - 01 - 02 5
3 1980 - 01 - 02 2
4 1980 - 01 - 02 3
我的目标是创建一个新变量,它只是昨天该组的频率值。
ID Date Freq YestFreq
1 1980 - 01 - 01 2 NA
2 1980 - 01 - 01 3 NA
3 1980 - 01 - 01 1 NA
4 1980 - 01 - 01 2 NA
1 1980 - 01 - 02 4 2
2 1980 - 01 - 02 5 3
3 1980 - 01 - 02 2 1
4 1980 - 01 - 02 3 2
我尝试的解决方案是:
x$DateID = paste(x$ID, x$Date)
x$yesterday = as.Date(x$Date) - 1
x$YesterdayDateID = paste(x$ID, x$yesterday)
result = numeric(nrow(x))
for(i in 1:nrow(x)){
answer = x$Freq[which(x$DateID == x$yesterdayDateID[i])]
if(length(answer) != 0){result[i] = answer} else{result[i] = NA}
}
x = cbind(x, result)
我的实际数据集有~600000行,(~300 Id和〜2000个唯一日期),所以我的上述解决方案需要2个小时才能运行。任何帮助将不胜感激。
答案 0 :(得分:5)
考虑到昨天可能存在的差距。我使用match
来识别前一天。从该索引然后按Id:
<强> data.table 强>
library(data.table)
setDT(x)[, Date := as.IDate(Date)][
, YestFreq := Freq[match(Date-1L, Date)], by=Id][]
# Id Date Freq YestFreq
# 1: 1 1980-01-01 2 NA
# 2: 2 1980-01-01 3 NA
# 3: 3 1980-01-01 1 NA
# 4: 4 1980-01-01 2 NA
# 5: 1 1980-01-02 4 2
# 6: 2 1980-01-02 5 3
# 7: 3 1980-01-02 2 1
# 8: 4 1980-01-02 3 2
<强> dplyr 强>
library(dplyr)
x$Date <- as.Date(x$Date)
x %>% group_by(Id) %>% mutate(YestFreq = Freq[match(Date - 1L, Date)])
# Id Date Freq YestFreq
# 1 1 1980-01-01 2 NA
# 2 2 1980-01-01 3 NA
# 3 3 1980-01-01 1 NA
# 4 4 1980-01-01 2 NA
# 5 1 1980-01-02 4 2
# 6 2 1980-01-02 5 3
# 7 3 1980-01-02 2 1
# 8 4 1980-01-02 3 2
答案 1 :(得分:2)
我们可以尝试
library(dplyr)
x %>%
arrange(as.Date(Date), Id) %>%
group_by(Id) %>%
mutate(YestFreq = lag(Freq))
# Id Date Freq YestFreq
# (dbl) (fctr) (dbl) (dbl)
#1 1 1980-01-01 2 NA
#2 2 1980-01-01 3 NA
#3 3 1980-01-01 1 NA
#4 4 1980-01-01 2 NA
#5 1 1980-01-02 4 2
#6 2 1980-01-02 5 3
#7 3 1980-01-02 2 1
#8 4 1980-01-02 3 2
答案 2 :(得分:2)
对于快速解决方案,请使用data.table包,对数据进行排序,并使用上一行的Freq值为每个组派生一列:
library(data.table)
x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))
# The real solution starts here (could even be done in one row):
y <- setDT(x) # convert to data.table
setkey(y,Id,Date) # "sort" the data
y[, .(Date, Freq, YestFreq=c(NA, Freq[1:(.N-1)])), by=.(Id)]
输出(按顺序排列不同 - >按Id):
Id Date Freq YestFreq
1: 1 1980-01-01 2 NA
2: 1 1980-01-02 4 2
3: 2 1980-01-01 3 NA
4: 2 1980-01-02 5 3
5: 3 1980-01-01 1 NA
6: 3 1980-01-02 2 1
7: 4 1980-01-01 2 NA
8: 4 1980-01-02 3 2
修改1:
你可以在一行中完成(并按要求对结果进行排序):
library(data.table)
x <- data.frame(Id = rep(c(1,2,3,4),2), Date = c(rep("1980-01-01",4), rep("1980-01-02",4)), Freq = c(2,3,1,2,4,5,2,3))
setDT(x, key=c("Id", "Date"))[, YestFreq := c(NA, Freq[1:(.N-1)]), by=Id][order(Date, Id)]
结果:
Id Date Freq YestFreq
1: 1 1980-01-01 2 NA
2: 2 1980-01-01 3 NA
3: 3 1980-01-01 1 NA
4: 4 1980-01-01 2 NA
5: 1 1980-01-02 4 2
6: 2 1980-01-02 5 3
7: 3 1980-01-02 2 1
8: 4 1980-01-02 3 2