我有每日时间序列,我希望每年的每个月都能获得最低分,但我想确保结果至少相隔10天。更具体一点,请解释以下示例数据帧。
>Data
Years Months Days Date A B
1 2003 December 1 2003-12-01 10 10
2 2003 December 2 2003-12-02 10 10
3 2003 December 3 2003-12-03 10 10
4 2003 December 4 2003-12-04 10 10
5 2003 December 5 2003-12-05 10 10
6 2003 December 6 2003-12-06 10 10
7 2003 December 7 2003-12-07 10 10
8 2003 December 8 2003-12-08 3 10
9 2003 December 9 2003-12-09 10 10
10 2003 December 10 2003-12-10 10 10
11 2003 December 11 2003-12-11 10 10
12 2003 December 12 2003-12-12 10 4
13 2003 December 13 2003-12-13 10 10
14 2003 December 14 2003-12-14 10 10
15 2003 December 15 2003-12-15 10 10
16 2003 December 16 2003-12-16 10 10
17 2003 December 17 2003-12-17 10 10
18 2003 December 18 2003-12-18 10 10
19 2003 December 19 2003-12-19 10 10
20 2003 December 20 2003-12-20 10 10
21 2003 December 21 2003-12-21 10 10
22 2003 December 22 2003-12-22 10 10
23 2003 December 23 2003-12-23 10 10
24 2003 December 24 2003-12-24 10 10
25 2003 December 25 2003-12-25 10 10
26 2003 December 26 2003-12-26 10 10
27 2003 December 27 2003-12-27 10 10
28 2003 December 28 2003-12-28 10 10
29 2003 December 29 2003-12-29 10 10
30 2003 December 30 2003-12-30 10 10
31 2003 December 31 2003-12-31 10 10
32 2004 January 1 2004-01-01 10 10
33 2004 January 2 2004-01-02 10 10
34 2004 January 3 2004-01-03 10 10
35 2004 January 4 2004-01-04 10 10
36 2004 January 5 2004-01-05 10 10
37 2004 January 6 2004-01-06 10 10
38 2004 January 7 2004-01-07 10 10
39 2004 January 8 2004-01-08 10 10
40 2004 January 9 2004-01-09 10 10
41 2004 January 10 2004-01-10 10 10
42 2004 January 11 2004-01-11 10 10
43 2004 January 12 2004-01-12 10 10
44 2004 January 13 2004-01-13 10 10
45 2004 January 14 2004-01-14 10 10
46 2004 January 15 2004-01-15 10 10
47 2004 January 16 2004-01-16 10 10
48 2004 January 17 2004-01-17 10 10
49 2004 January 18 2004-01-18 10 10
50 2004 January 19 2004-01-19 10 10
51 2004 January 20 2004-01-20 10 10
52 2004 January 21 2004-01-21 10 10
53 2004 January 22 2004-01-22 10 10
54 2004 January 23 2004-01-23 10 10
55 2004 January 24 2004-01-24 10 10
56 2004 January 25 2004-01-25 5 4
57 2004 January 26 2004-01-26 10 10
58 2004 January 27 2004-01-27 10 10
59 2004 January 28 2004-01-28 10 10
60 2004 January 29 2004-01-29 10 10
61 2004 January 30 2004-01-30 10 10
62 2004 January 31 2004-01-31 10 10
63 2004 February 1 2004-02-01 10 10
64 2004 February 2 2004-02-02 5 4
65 2004 February 3 2004-02-03 10 10
66 2004 February 4 2004-02-04 10 10
67 2004 February 5 2004-02-05 10 10
68 2004 February 6 2004-02-06 10 10
69 2004 February 7 2004-02-07 10 10
70 2004 February 8 2004-02-08 10 10
71 2004 February 9 2004-02-09 7 6
72 2004 February 10 2004-02-10 10 10
73 2004 February 11 2004-02-11 10 10
74 2004 February 12 2004-02-12 10 10
75 2004 February 13 2004-02-13 10 10
76 2004 February 14 2004-02-14 10 10
77 2004 February 15 2004-02-15 10 10
78 2004 February 16 2004-02-16 10 10
79 2004 February 17 2004-02-17 10 10
80 2004 February 18 2004-02-18 10 10
81 2004 February 19 2004-02-19 10 10
82 2004 February 20 2004-02-20 10 10
83 2004 February 21 2004-02-21 10 10
84 2004 February 22 2004-02-22 10 10
85 2004 February 23 2004-02-23 10 10
86 2004 February 24 2004-02-24 10 10
87 2004 February 25 2004-02-25 10 10
88 2004 February 26 2004-02-26 10 10
89 2004 February 27 2004-02-27 10 10
90 2004 February 28 2004-02-28 10 10
91 2004 February 29 2004-02-29 10 10
我想做的几乎就是aggregate()做什么
min <- aggregate(Data[5:6], by= list(Data$Months, Data$Years), FUN = min)
Group.1 Group.2 A B
December 2003 3 4
January 2004 5 4
February 2004 5 4
相反,为了获得每个A和B的最小值,与前几个月的最小值相差至少10天。
所以我想得到:
Group.1 Group.2 A B
December 2003 3 4
January 2004 5 4
February 2004 7 6
有什么想法吗?
答案 0 :(得分:2)
好吧,如果你有兴趣,我有一个凌乱的解决方案:)
首先,让我们确保正确排序月份并为月/年组合创建一个因子
data$Months<-factor(data$Months, levels=month.name)
data$MY<-interaction(data$Months, data$Years, drop=T)
现在我将定义一些辅助函数
getpaddoff<-function(n) {
function(x) {
a<-which.min(x)+n-length(x);
ifelse(a>0,a,0)
}
}
rollright<-function(x, add=0) {
n<-names(x)
x<-head(c(add,x), -1)
names(x)<-n;
x
}
getpadoff
函数将返回下一个月所需的非重叠天数。而rollright
将允许我将回报从一个月转移到下一个月。编写的getpadoff
使得它要求每个月的每一天都有数据条目。
好的,现在我们开始将这些应用于数据。我们得到一个功能,以确保10天的差距。然后我们根据月/年分割数据。然后我们计算每个月必须删除的天数,因为最小值下降得太接近上个月末。
paddoff <- getpaddoff(10)
ss <- split(data[c("A","B")], data$MY)
offsets <- rollright(lapply(ss, function(x) sapply(x, padoff)),
add=list(c(A=0, B=0)))
一旦我们拥有这些值,我们就可以找到每个月的非重叠最小值。
rr<-Map(function(d,off) {
d<-as.matrix(d)
stopifnot(ncol(d)==length(off))
for(i in seq_along(off)) {
if(off[i]>0)
d[1:off[i],i]<-Inf
}
apply(d,2,min)
}, ss, offsets)
do.call(rbind,rr)
以下是结果
A B
December.2003 3 4
January.2004 5 4
February.2004 7 6
我不确定你究竟是如何需要格式化的结果,但这至少会提取你想要的值。
答案 1 :(得分:2)
这个解决方案只有十几行。我们首先将输入数据帧拆分为数据帧列表ym
,每个数据帧代表一年/月。然后我们为我们希望计算最小值的列提供服务。对于每一列,我们迭代ym
组件,以便对于每个组件,即对于每个data.frame,我们将其子集化为s
,这是一个至少在10天之后的行的数据帧。在minDate
之前,计算最小值的行,ix
,更新minDate
并返回result
:
ym <- split(DF, format(DF$Date, "%Y-%m"))
sapply(c("A", "B"), function(col) {
minDate <- min(DF$Date) - 10
result <- vector(length = length(ym))
for(i in seq_along(ym)) {
s <- subset(ym[[i]], Date >= minDate + 10)
ix <- which.min(s[[col]])
minDate <- s$Date[ix]
result[i] <- min(s[[col]][ix])
}
setNames(result, names(ym))
})
这给出了:
A B
2003-12 3 4
2004-01 5 4
2004-02 7 6
(我们只使用"Date"
的{{1}},“A
”和“B"
列,因此我们可以将DF
缩减为第一列。)
注意:我们假设此数据框为输入:
DF
答案 2 :(得分:1)
这是创建nonoverlapmin
函数的不同策略。这里我们假设数据已经在每个组中正确排序。我将确保数据正确排序,并创建一个综合因子,以便在一个变量中跟踪月/年
data$Months <- factor(data$Months, levels=month.name)
data$MY <- interaction(data$Months, data$Years, drop=T)
这是主要功能
nonoverlapmin <- function(vals, groups, dist) {
stopifnot(length(vals)==length(groups))
groups<-ordered(groups)
r <- numeric(nlevels(groups))
names(r) <- levels(groups)
for (v in levels(groups)) {
i <- which.min(vals[groups<=v])
r[v] <- vals[i]
vals[ 1:min(max(i+dist, max(which(groups==v))),length(vals))]<-Inf
}
r
}
我们可以通过调用
来使用它nonoverlapmin(data$A, data$MY, 10)
# December.2003 January.2004 February.2004
# 3 5 7
nonoverlapmin(data$B, data$MY, 10)
# December.2003 January.2004 February.2004
# 4 4 6
该方法使用循环逐步找到最小值,然后用dist
替换下一个Inf
值,这样它们就不会被选为最小值。循环逐步通过值列表逐个工作。
答案 3 :(得分:1)
我认为有时最好回归基础,而不是试图找到最有效的矢量实现。请记住,开发人员的时间比CPU时间更重要:P
一个简单的for循环可以解决问题。
read.table("Data.txt", header=T, sep="\t", stringsAsFactors=F) -> Data
result = matrix(ncol=4, nrow=0)
min_indA = -100; min_indB = -100; minA = 100; minB = 100
curMonth = "December"
curYear = 2003
for(i in 1:nrow(Data)) {
if(curMonth == Data[i,"Months"] & curYear == Data[i,"Years"]) {
if(Data[i,"A"] < minA & i - min_indA >= 10) {
minA = Data[i,"A"]
cur_indA = i
}
if(Data[i,"B"] < minB & i - min_indB >= 10) {
minB = Data[i,"B"]
cur_indB = i
}
} else {
result = rbind(result, c(curYear, curMonth, minA, minB))
minA = Data[i,"A"]; minB = Data[i,"B"]; min_indA = cur_indA; min_indB = cur_indB;
curMonth = Data[i,"Months"]
curYear = Data[i,"Years"]
}
}
result = rbind(result, c(curYear, curMonth, minA, minB))
打印(结果)
[,1] [,2] [,3] [,4]
[1,] "2003" "December" "3" "4"
[2,] "2004" "January" "5" "4"
[3,] "2004" "February" "7" "6"
答案 4 :(得分:1)
我认为这个问题的最有效途径是递归函数......
#Load data
require("data.table")
Data <- fread("min10.csv")
Data <- data.table(Data)
Data[,Date:=as.Date(Date)]
这是函数..
#Build recursive function
findmin10 <- function(Data,Var){
Data$Var1 <- get(Var,Data)
#Find min date for value A
Data[,minVar:=min(Var1),by=c("Years","Months")]
Data[,minVarDate:=(Var1==minVar)*1]
Summ <- Data[minVarDate==1][,ord:=.I]
Summ[,Date.Diff:=c(NA,head(as.numeric(Date[ord+1]-Date[ord]),-1))]
To.Delete.Date <- Summ[Date.Diff<10]$Date
#Utilize recursion until 10 day spacing requirement is met
if (length(To.Delete.Date)!=0){
Data <- Data[!Date%in%To.Delete.Date]
findmin10(Data,Var=Var)
} else {
return(Summ[,list(Years,Months,Var1,VarName=Var)])
}
}
使用lapply检索多个变量的结果
#Run through multiple variables you want to find the min 10 for
outtable <- rbindlist(lapply(c("A","B"),FUN=function(x) findmin10(Data=Data,Var=x)))
以所需格式推出结果。
#Cast it out to make it look like desired result
library("reshape2")
dcast.data.table(outtable,Years+Months~VarName,value.var="Var1")
# Years Months A B
# 1: 2003 December 3 4
# 2: 2004 February 7 6
# 3: 2004 January 5 4