如何返回每年至少相隔10天的每月最低值

时间:2014-06-02 20:08:36

标签: r

我有每日时间序列,我希望每年的每个月都能获得最低分,但我想确保结果至少相隔10天。更具体一点,请解释以下示例数据帧。

>Data
        Years   Months     Days Date        A   B
    1   2003    December    1   2003-12-01  10  10
    2   2003    December    2   2003-12-02  10  10
    3   2003    December    3   2003-12-03  10  10
    4   2003    December    4   2003-12-04  10  10
    5   2003    December    5   2003-12-05  10  10
    6   2003    December    6   2003-12-06  10  10
    7   2003    December    7   2003-12-07  10  10
    8   2003    December    8   2003-12-08  3   10
    9   2003    December    9   2003-12-09  10  10
    10  2003    December    10  2003-12-10  10  10
    11  2003    December    11  2003-12-11  10  10
    12  2003    December    12  2003-12-12  10  4
    13  2003    December    13  2003-12-13  10  10
    14  2003    December    14  2003-12-14  10  10
    15  2003    December    15  2003-12-15  10  10
    16  2003    December    16  2003-12-16  10  10
    17  2003    December    17  2003-12-17  10  10
    18  2003    December    18  2003-12-18  10  10
    19  2003    December    19  2003-12-19  10  10
    20  2003    December    20  2003-12-20  10  10
    21  2003    December    21  2003-12-21  10  10
    22  2003    December    22  2003-12-22  10  10
    23  2003    December    23  2003-12-23  10  10
    24  2003    December    24  2003-12-24  10  10
    25  2003    December    25  2003-12-25  10  10
    26  2003    December    26  2003-12-26  10  10
    27  2003    December    27  2003-12-27  10  10
    28  2003    December    28  2003-12-28  10  10
    29  2003    December    29  2003-12-29  10  10
    30  2003    December    30  2003-12-30  10  10
    31  2003    December    31  2003-12-31  10  10
    32  2004    January     1   2004-01-01  10  10
    33  2004    January     2   2004-01-02  10  10
    34  2004    January     3   2004-01-03  10  10
    35  2004    January     4   2004-01-04  10  10
    36  2004    January     5   2004-01-05  10  10
    37  2004    January     6   2004-01-06  10  10
    38  2004    January     7   2004-01-07  10  10
    39  2004    January     8   2004-01-08  10  10
    40  2004    January     9   2004-01-09  10  10
    41  2004    January     10  2004-01-10  10  10
    42  2004    January     11  2004-01-11  10  10
    43  2004    January     12  2004-01-12  10  10
    44  2004    January     13  2004-01-13  10  10
    45  2004    January     14  2004-01-14  10  10
    46  2004    January     15  2004-01-15  10  10
    47  2004    January     16  2004-01-16  10  10
    48  2004    January     17  2004-01-17  10  10
    49  2004    January     18  2004-01-18  10  10
    50  2004    January     19  2004-01-19  10  10
    51  2004    January     20  2004-01-20  10  10
    52  2004    January     21  2004-01-21  10  10
    53  2004    January     22  2004-01-22  10  10
    54  2004    January     23  2004-01-23  10  10
    55  2004    January     24  2004-01-24  10  10
    56  2004    January     25  2004-01-25  5   4
    57  2004    January     26  2004-01-26  10  10
    58  2004    January     27  2004-01-27  10  10
    59  2004    January     28  2004-01-28  10  10
    60  2004    January     29  2004-01-29  10  10
    61  2004    January     30  2004-01-30  10  10
    62  2004    January     31  2004-01-31  10  10
    63  2004    February    1   2004-02-01  10  10
    64  2004    February    2   2004-02-02  5   4
    65  2004    February    3   2004-02-03  10  10
    66  2004    February    4   2004-02-04  10  10
    67  2004    February    5   2004-02-05  10  10
    68  2004    February    6   2004-02-06  10  10
    69  2004    February    7   2004-02-07  10  10
    70  2004    February    8   2004-02-08  10  10
    71  2004    February    9   2004-02-09  7   6
    72  2004    February    10  2004-02-10  10  10
    73  2004    February    11  2004-02-11  10  10
    74  2004    February    12  2004-02-12  10  10
    75  2004    February    13  2004-02-13  10  10
    76  2004    February    14  2004-02-14  10  10
    77  2004    February    15  2004-02-15  10  10
    78  2004    February    16  2004-02-16  10  10
    79  2004    February    17  2004-02-17  10  10
    80  2004    February    18  2004-02-18  10  10
    81  2004    February    19  2004-02-19  10  10
    82  2004    February    20  2004-02-20  10  10
    83  2004    February    21  2004-02-21  10  10
    84  2004    February    22  2004-02-22  10  10
    85  2004    February    23  2004-02-23  10  10
    86  2004    February    24  2004-02-24  10  10
    87  2004    February    25  2004-02-25  10  10
    88  2004    February    26  2004-02-26  10  10
    89  2004    February    27  2004-02-27  10  10
    90  2004    February    28  2004-02-28  10  10
    91  2004    February    29  2004-02-29  10  10

我想做的几乎就是aggregate()做什么

min <- aggregate(Data[5:6], by= list(Data$Months, Data$Years), FUN = min)


Group.1     Group.2 A   B
December    2003    3   4
January     2004    5   4
February    2004    5   4

相反,为了获得每个A和B的最小值,与前几个月的最小值相差至少10天。

所以我想得到:

Group.1     Group.2 A   B
December    2003    3   4
January     2004    5   4
February    2004    7   6

有什么想法吗?

5 个答案:

答案 0 :(得分:2)

好吧,如果你有兴趣,我有一个凌乱的解决方案:)

首先,让我们确保正确排序月份并为月/年组合创建一个因子

data$Months<-factor(data$Months, levels=month.name)
data$MY<-interaction(data$Months, data$Years, drop=T)

现在我将定义一些辅助函数

getpaddoff<-function(n) {
    function(x) {
        a<-which.min(x)+n-length(x); 
        ifelse(a>0,a,0) 
    }
}
rollright<-function(x, add=0) {
  n<-names(x)
  x<-head(c(add,x), -1)
  names(x)<-n;
  x
}

getpadoff函数将返回下一个月所需的非重叠天数。而rollright将允许我将回报从一个月转移到下一个月。编写的getpadoff使得它要求每个月的每一天都有数据条目。

好的,现在我们开始将这些应用于数据。我们得到一个功能,以确保10天的差距。然后我们根据月/年分割数据。然后我们计算每个月必须删除的天数,因为最小值下降得太接近上个月末。

paddoff <- getpaddoff(10)
ss <- split(data[c("A","B")], data$MY)
offsets <- rollright(lapply(ss, function(x) sapply(x, padoff)), 
    add=list(c(A=0, B=0)))

一旦我们拥有这些值,我们就可以找到每个月的非重叠最小值。

rr<-Map(function(d,off) {
    d<-as.matrix(d)
    stopifnot(ncol(d)==length(off))
    for(i in seq_along(off)) {
         if(off[i]>0)
             d[1:off[i],i]<-Inf
    }
    apply(d,2,min)
}, ss, offsets)
do.call(rbind,rr)

以下是结果

              A B
December.2003 3 4
January.2004  5 4
February.2004 7 6

我不确定你究竟是如何需要格式化的结果,但这至少会提取你想要的值。

答案 1 :(得分:2)

这个解决方案只有十几行。我们首先将输入数据帧拆分为数据帧列表ym,每个数据帧代表一年/月。然后我们为我们希望计算最小值的列提供服务。对于每一列,我们迭代ym组件,以便对于每个组件,即对于每个data.frame,我们将其子集化为s,这是一个至少在10天之后的行的数据帧。在minDate之前,计算最小值的行,ix,更新minDate并返回result

ym <- split(DF, format(DF$Date, "%Y-%m"))
sapply(c("A", "B"), function(col) {
   minDate <- min(DF$Date) - 10
   result <- vector(length = length(ym)) 
   for(i in seq_along(ym)) {
       s <- subset(ym[[i]], Date >= minDate + 10)
       ix <- which.min(s[[col]])
       minDate <- s$Date[ix]
       result[i] <- min(s[[col]][ix])
   }
   setNames(result, names(ym))
})

这给出了:

        A B
2003-12 3 4
2004-01 5 4
2004-02 7 6

(我们只使用"Date"的{​​{1}},“A”和“B"列,因此我们可以将DF缩减为第一列。)

注意:我们假设此数据框为输入:

DF

答案 2 :(得分:1)

这是创建nonoverlapmin函数的不同策略。这里我们假设数据已经在每个组中正确排序。我将确保数据正确排序,并创建一个综合因子,以便在一个变量中跟踪月/年

data$Months <- factor(data$Months, levels=month.name)
data$MY <- interaction(data$Months, data$Years, drop=T)

这是主要功能

nonoverlapmin <- function(vals, groups, dist) {
    stopifnot(length(vals)==length(groups))
    groups<-ordered(groups)
    r <- numeric(nlevels(groups))
    names(r) <- levels(groups)
    for (v in levels(groups)) {
        i <- which.min(vals[groups<=v])
        r[v] <- vals[i]
        vals[ 1:min(max(i+dist, max(which(groups==v))),length(vals))]<-Inf
    }
    r
}

我们可以通过调用

来使用它
nonoverlapmin(data$A, data$MY, 10)
# December.2003  January.2004 February.2004 
#             3             5             7 

nonoverlapmin(data$B, data$MY, 10)
# December.2003  January.2004 February.2004 
#             4             4             6 

该方法使用循环逐步找到最小值,然后用dist替换下一个Inf值,这样它们就不会被选为最小值。循环逐步通过值列表逐个工作。

答案 3 :(得分:1)

我认为有时最好回归基础,而不是试图找到最有效的矢量实现。请记住,开发人员的时间比CPU时间更重要:P

一个简单的for循环可以解决问题。

read.table("Data.txt", header=T, sep="\t", stringsAsFactors=F) -> Data
result = matrix(ncol=4, nrow=0)
min_indA = -100; min_indB = -100; minA = 100; minB = 100
curMonth = "December"
curYear = 2003
for(i in 1:nrow(Data)) {
    if(curMonth == Data[i,"Months"] & curYear == Data[i,"Years"]) {
        if(Data[i,"A"] < minA & i - min_indA >= 10) {
            minA = Data[i,"A"]
            cur_indA = i
        }
        if(Data[i,"B"] < minB & i - min_indB >= 10) {
            minB = Data[i,"B"]
            cur_indB = i
        }
    } else {
        result = rbind(result, c(curYear, curMonth, minA, minB))
        minA = Data[i,"A"]; minB = Data[i,"B"]; min_indA = cur_indA; min_indB = cur_indB;
        curMonth = Data[i,"Months"]
        curYear = Data[i,"Years"]
    }
}
result = rbind(result, c(curYear, curMonth, minA, minB))
  

打印(结果)

     [,1]   [,2]       [,3] [,4]
[1,] "2003" "December" "3"  "4" 
[2,] "2004" "January"  "5"  "4" 
[3,] "2004" "February" "7"  "6"

答案 4 :(得分:1)

我认为这个问题的最有效途径是递归函数......

#Load data
require("data.table")
Data <- fread("min10.csv")
Data <- data.table(Data)
Data[,Date:=as.Date(Date)]

这是函数..

#Build recursive function
findmin10 <- function(Data,Var){

  Data$Var1 <- get(Var,Data)

  #Find min date for value A
  Data[,minVar:=min(Var1),by=c("Years","Months")]
  Data[,minVarDate:=(Var1==minVar)*1]
  Summ <- Data[minVarDate==1][,ord:=.I]
  Summ[,Date.Diff:=c(NA,head(as.numeric(Date[ord+1]-Date[ord]),-1))]
  To.Delete.Date <- Summ[Date.Diff<10]$Date

  #Utilize recursion until 10 day spacing requirement is met
  if (length(To.Delete.Date)!=0){
    Data <- Data[!Date%in%To.Delete.Date]
    findmin10(Data,Var=Var)
  } else {
    return(Summ[,list(Years,Months,Var1,VarName=Var)])
  }
}

使用lapply检索多个变量的结果

#Run through multiple variables you want to find the min 10 for
outtable <- rbindlist(lapply(c("A","B"),FUN=function(x) findmin10(Data=Data,Var=x)))

以所需格式推出结果。

#Cast it out to make it look like desired result
library("reshape2")
dcast.data.table(outtable,Years+Months~VarName,value.var="Var1")

#    Years   Months A B
# 1:  2003 December 3 4
# 2:  2004 February 7 6
# 3:  2004  January 5 4