使用Caret CreateTimeSlices通过机器学习模型增加窗口预测

时间:2016-07-12 21:23:21

标签: r machine-learning time-series r-caret

我创建了这样的数据集,原因是我想使用基于时间序列的数据集

HttpClient client = HttpClientBuilder.create().build();
HttpPost post = new HttpPost("http://api/endpoint");
post.setHeader("Content-type", "application/json");

post.setEntity(new StringEntity(content));

HttpResponse response = client.execute(post);
logger.info("Response Code : "
            + response.getStatusLine().getStatusCode());

BufferedReader rd = new BufferedReader(
                new InputStreamReader(response.getEntity().getContent()));

StringBuffer result = new StringBuffer();
String line = "";
while ((line = rd.readLine()) != null) {
    result.append(line);
}
logger.info("Response details "+result);

然后我用一个可以添加的数字替换年份以保持月份指数(我猜有更聪明的方法)

getSymbols("^GSPC")
DF=data.frame(GSPC,DATE=time(GSPC))
PriceChange=(DF$GSPC.Close-DF$GSPC.Open)
DF$Class<-as.factor(ifelse(PriceChange>0,"UP","DOWN")) 
DF$year = as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH = as.numeric(format(DF$DATE, format = "%m"))


GSPC.Open GSPC.High GSPC.Low GSPC.Close GSPC.Volume Class Year Month
1418.03   1429.42  1407.86    1416.60  3429160000   Down  2007   1
1416.60   1421.84  1408.43    1418.34  3004460000   Up    2007   1

所以数据集现在有了附加列

DF=data.table(DF)

DF[year==2007,year:=0]
DF[year==2008,year:=12]
DF[year==2009,year:=24]
DF[year==2010,year:=36]
DF[year==2011,year:=48]
DF[year==2012,year:=60]
DF[year==2013,year:=72]
DF[year==2014,year:=84]
DF[year==2015,year:=96]
DF[year==2016,year:=108]

DF$Month_Index=(DF$year+DF$MONTH)

然后我使用Month_Index 01 01 Month_Index=115 中的createTimeSlices来进行不断增长的窗口预测。

caret

现在我想保存每个步骤的预测及其正确的索引以及准确性。我的问题是我该怎么做。

2 个答案:

答案 0 :(得分:2)

实现此目标的方法之一:

library(quantmod)
library(data.table)
library(caret)

getSymbols("^GSPC")
DF <- data.frame(GSPC,DATE=time(GSPC))
PriceChange <- (DF$GSPC.Close-DF$GSPC.Open)
DF$Class <- as.factor(ifelse(PriceChange>0,"UP","DOWN"))

您可以通过以下两种方式创建月度指数:

# 1
DF$yearMon <- zoo::as.yearmon(DF$DATE)
DF <- data.table(DF)
DF[,  Month_Index:= .GRP, by = yearMon]

# 2
DF$year <- as.numeric(format(DF$DATE, format = "%Y"))
DF$MONTH <- as.numeric(format(DF$DATE, format = "%m"))
DF[, Month_Index2 := .GRP, by = .(year, MONTH)]

identical(DF$Month_Index, DF$Month_Index2)
[1] TRUE


Month_Index <- length(unique(DF$Month_Index))

TimeSlices <- createTimeSlices(1:Month_Index, 5, horizon = 2,
                            fixedWindow = FALSE, skip = 0)

创建三个空列表以保存结果:

totalSlices <- length(TimeSlices$train)

plsFitTime <- vector("list", totalSlices)
Prediction <- vector("list", totalSlices)
Accuracy   <- vector("list", totalSlices)

将所有结果保存到这些列表中:

k <- 1:totalSlices

for(i in seq_along(k))
{

    plsFitTime[[i]] <- train(Class~.,
                             data = DF[TimeSlices$train[[i]],],
                             method = "pls")

    Prediction[[i]] <- predict(plsFitTime[[i]], 
                              DF[TimeSlices$test[[i]],])

    Accuracy[[i]] <- confusionMatrix(Prediction[[i]], 
                                     DF[TimeSlices$test[[i]],]$Class)$overall[1]

}

所有模型都保存在plsFitTime中,Prediction中的预测和Accuracy中的准确度。

<强>更新

更强大的方法是使用purrr包。

创建时间片后,您可以使用:

library(purrr)

customFunction <- function(x, y) {
    model <- train(Class~.,
                   data = DF[x],
                   method = "pls")

    prediction <- predict(model, DF[y])

    accuracy <- confusionMatrix(prediction, 
                                DF[y]$Class)$overall[1]

    return(list(prediction, accuracy))
}

results <- map2_df(TimeSlices$train, TimeSlices$test, customFunction)

map2_df是一个函数,它将2个列表.x.y作为参数,将函数.f应用于这些列表的所有元素,并将结果作为数据帧。

您可以动态创建函数(就像lapply),但我在全局环境中创建customFunction只是为了保持代码清洁。

函数中的

DF[x]相当于DF[TimeSlices$train[[n]]]DF[y]DF[TimeSlices$test[[n]]]

map2_df现在执行上面for循环所做的所有操作,并且只返回表单中所有模型的预测准确度一个数据帧。

class(results)
[1] "tbl_df"     "tbl"        "data.frame" 

dim(results)
[1] 2 109

results中的每一列都是一个列表。 109列是109个模型的结果。

要访问每个模型的结果(在本例中为预测和准确性),请使用results$columnNameresults[[columnNumber]]

如果您还要存储模型,只需更改return中的customFunction语句即可包含模型:return(list(model, prediction, accuracy))

答案 1 :(得分:1)

您可以使用plyr使用列表收集结果:

results <- plyr::llply(1:length(TimeSlices$train), function(i){
  plsFitTime <- train(Class~.,
                      data = DF[TimeSlices$train[[i]],],
                      method = "pls")

  testData <- DF[TimeSlices$test[[i]],]
  Prediction <- predict(plsFitTime, testData)

  list(index = i, model = plsFitTime, prediction = Prediction)
})

# The model created for slice no. 3
results[[3]]$model

# ... and it's predictions
results[[3]]$prediction

如果需要,您可以在传递给llply的函数内添加准确性。