如何在R中实现支持向量机

时间:2015-12-30 11:55:36

标签: r machine-learning svm

我是机械学习(不是数学家)的新手,也是从视频和书籍中学习ML的人。我对朴素贝叶斯,svm,决策树等算法有基本的了解,我使用ML来模拟股市的日常回报。我想对我的ML使用非线性回归算法,因此选择支持向量机回归,因为它很受欢迎。我使用交易日和EMA差异作为特征向量(X)和价格变化作为标签(Y)。以下是我的代码

library("quantmod")
#Adding libraries
library("lubridate")
#Makes it easier to work with the dates 
library("e1071")
#Gives us access to the svm
stockData <- new.env()
tickers <- 'AAPL'
startDate = as.Date("2015-11-01")
# The beginning of the date range we want to look at 


symbol = getSymbols(tickers,from=startDate, auto.assign=F)
# Retrieving Apple’s daily OHLCV from Yahoo Finance 
DayofWeek<-wday(symbol, label=TRUE)
#Find the day of the week 
Class<- Cl(symbol) - Op(symbol)
#price change
EMA5<-EMA(Cl(symbol),n = 5)
#We are calculating a 5-period EMA off the open price

EMA10<-EMA(Cl(symbol),n = 10)
#Then the 10-period EMA, also off the open price 
EMACross <- EMA5 - EMA10
#Positive values correspond to the 5-period EMA being above the 10-period EMA 

EMACross<-round(EMACross,2)


DataSet2<-data.frame(DayofWeek,EMACross, Class)
DataSet2<-DataSet2[-c(1:10),]
#We need to remove the instances where the 10-period moving average is still being calculated
m<-nrow(DataSet2)
n<-round((nrow(DataSet2)*2)/3)
TrainingSet<-DataSet2[1:n,]
#We will use ⅔ of the data to train the model
TestSet<-DataSet2[(n+1):m,]
#And ⅓ to test it on unseen data 
EMACrossModel<-svm( Cl(symbol) ~ ., data=TrainingSet) 
summary(EMACrossModel)
pred<-predict(EMACrossModel,TestSet[,-3])

当我运行上面的代码时,我收到此错误

> EMACrossModel<-svm( Cl(symbol) ~ ., data=TrainingSet) 
Error in model.frame.default(formula = Cl(symbol) ~ ., data = TrainingSet,  : 
  variable lengths differ (found for 'DayofWeek')

所以我的问题是(原谅我,但我有不止一个问题)

1) How to solve my above problem?

2) Can in use both qualitative (eg: mon,tue,wed etc) and quantitative(eg 1.0,0.1,100 etc) data together in SVM regressions 

3) How can i plot my above results with SVM decision
boundaries?

EDITED

DataSet2

          DayofWeek   EMA AAPL.Close
2015-11-16       Mon -2.77   2.800003
2015-11-17      Tues -2.51  -1.229996
2015-11-18       Wed -1.67   1.529999
2015-11-19     Thurs -0.89   1.140000
2015-11-20       Fri -0.32   0.100006
2015-11-23       Mon -0.23  -1.519997
2015-11-24      Tues  0.00   1.549995
2015-11-25       Wed  0.00  -1.180000
2015-11-27       Fri -0.03  -0.480003
2015-11-30       Mon  0.02   0.310005
2015-12-01      Tues -0.09  -1.410004
2015-12-02       Wed -0.31  -1.059997
2015-12-03     Thurs -0.57  -1.350006
2015-12-04       Fri -0.10   3.739998
2015-12-07       Mon  0.05  -0.700004
2015-12-08      Tues  0.12   0.710006
2015-12-09       Wed -0.24  -2.019996
2015-12-10     Thurs -0.35   0.129997
2015-12-11       Fri -0.83  -2.010002
2015-12-14       Mon -1.15   0.300003
2015-12-15      Tues -1.56  -1.450004
2015-12-16       Wed -1.56   0.269996
2015-12-17     Thurs -1.82  -3.039994
2015-12-18       Fri -2.30  -2.880005
2015-12-21       Mon -2.23   0.050003
2015-12-22      Tues -2.07  -0.169999
2015-12-23       Wed -1.64   1.340004
2015-12-24     Thurs -1.40  -0.970001
2015-12-28       Mon -1.37  -0.769996
2015-12-29      Tues -0.98   1.779999
2015-12-30       Wed -0.92  -1.260002

修改后的下面的代码运行但给出了不同的答案

这些是修改

EMACrossModel<-ksvm(  Cl(symbol[1:n]) ~ ., data=TrainingSet,kernel="rbfdot",C=10) #kernlab libraries

pred<-predict(EMACrossModel,TestSet)

结果

> EMACrossModel
Support Vector Machine object of class "ksvm" 

SV type: eps-svr  (regression) 
 parameter : epsilon = 0.1  cost C = 10 

Gaussian Radial Basis kernel function. 
 Hyperparameter : sigma =  0.294836572886287 

Number of Support Vectors : 17 

Objective Function Value : -49.1082 
Training error : 0.138329 

> pred
          [,1]
 [1,] 119.7267
 [2,] 119.9733
 [3,] 120.7236
 [4,] 121.8324
 [5,] 121.5632
 [6,] 121.4652
 [7,] 119.6438
 [8,] 119.6962
 [9,] 119.0775
[10,] 116.4956

我除了预测结果是这样的

     [,1]
-1.327996
1.229939
-1.130000
0.100006
-1.519997
-0.480003
 1.310005
-1.410004
-1.059997
1.350006
-2.739998
1.700004

我的猜测是我当前的代码将股票价格而不是价格变化视为Y并使用它来模拟 EMACrossModel 。我对吗?如果是,我怎么能解决这个问题。

1 个答案:

答案 0 :(得分:2)

关于问题一 您通过删除一些数据来形成您的Trainingset。但是,您没有限制符号集:

 EMACrossModel<-svm( Cl(symbol[1:n]) ~ ., data=TrainingSet)

我只是意识到你更想要的是:

 EMACrossModel<-svm( AAPL.Close ~ ., data=TrainingSet) 

一般来说,公式如下:  Cl(符号[1:n])〜。 定义学到的东西。目前它是“符号”。但是,我假设您要预测AAPL.Close列。 公式是R(https://stat.ethz.ch/R-manual/R-devel/library/stats/html/formula.html)中的一般概念。值得投入一点时间来理解这些。 的修改 根据您的上述评论,这似乎得到了证实。结果是

-0.1926745  
0.3578645  
0.1830046  
0.6362871 
-0.3760084 
-0.1443156  
0.2615674  
0.2589130 
-0.4779677 
-0.5928780 

结束编辑

关于问题二,它取决于实现(和内核),但似乎就是这种情况。

关于你的第三个问题。 E1071包中包含一个示例:

data(cats, package = "MASS")
m <- svm(Sex~., data = cats)
plot(m, cats)

修改 我刚刚意识到这个绘图函数仅适用于分类器,但不适用于回归。但是,您可以轻松构建自己的绘图功能。为简单起见,我首先将星期几转换为数字。

  DataSet2$DayofWeek <- as.numeric(DataSet2$DayofWeek)

并重建分类器 之后,您可以通过

显示分类器
### plot the results of the support vector machine by
# first generating a grid covering the data range

#generate a sequence of 100 numbers between the minimum and maximum of DataSet2EMA 
plot.ema.vec <- seq(min(DataSet2$EMA),max(DataSet2$EMA),(max(DataSet2$EMA)-min(DataSet2$EMA))/100)
#generate a "grid" of artificial data points 1:7 are the weekdays
# can be replaced by c("Mon",...,"Sun")
datagrid <- expand.grid(1:7,plot.ema.vec)
# set the names of the grid according to the dataset s.t. the classifier can use the data as input
names(datagrid) <- names(DataSet2[,1:2])
#calculate the predictions of the classifier
grid.pred <- predict(EMACrossModel,datagrid)
# normalise the prediction in [0,1] range to use it as colors
cols <- (grid.pred-min(grid.pred))/(max(grid.pred)-min(grid.pred))
# plot the decisions for the data 
plot(datagrid$DayofWeek,datagrid$EMA , col=rgb(blue=cols,red=1-cols,green=0))