在R中的Methyl450k数据集上实现XGBoost

时间:2019-03-17 21:45:29

标签: r classification xgboost

我正在尝试在Methyl450k数据集上实现XGBoost。数据大约包含480000个以上的特定CpG站点,其后续beta值介于0和1之间。下面是数据(样本10列,带响应):

   cg13869341 cg14008030 cg12045430 cg20826792 cg00381604 cg20253340 cg21870274 cg03130891 cg24335620 cg16162899 response
1   0.8612869  0.6958909 0.07918330 0.10816711 0.03484078  0.4875475  0.7475878 0.11051578  0.7120003  0.8453396        0
2   0.8337106  0.6276754 0.09811698 0.08934333 0.03348864  0.6300766  0.7753453 0.08652890  0.6465146  0.8137132        0
3   0.8516102  0.6575332 0.13310207 0.07990076 0.04195286  0.4325115  0.7257208 0.14334007  0.7384455  0.8054013        0
4   0.8970384  0.6955810 0.08134887 0.08950676 0.03578006  0.4711689  0.7214661 0.08299838  0.7718571  0.8151683        0
5   0.8562323  0.7204416 0.08078766 0.14902533 0.04274820  0.4769631  0.8034706 0.16473891  0.7143823  0.8475410        0
6   0.8613325  0.6527599 0.10158672 0.15459204 0.04839691  0.4805285  0.8004808 0.12598627  0.8218743  0.8222552        0
7   0.9168869  0.5963966 0.11457045 0.13245761 0.03720798  0.5067649  0.6806004 0.13601034  0.7063457  0.8509160        0
8   0.9002366  0.6898320 0.07029171 0.07158694 0.03875135  0.7065322  0.8167016 0.15394095  0.7226098  0.8310477        0
9   0.8876504  0.6172154 0.13511072 0.15276686 0.06149520  0.5642073  0.7177438 0.14752285  0.6846876  0.8360360        0
10  0.8992898  0.6361644 0.15423780 0.19111275 0.05700406  0.4941239  0.7819968 0.10109936  0.6680640  0.8504023        0
11  0.8997905  0.5906462 0.10411472 0.15006796 0.04157008  0.4931531  0.7857664 0.13430963  0.6946644  0.8326747        0
12  0.9009607  0.6721858 0.09081460 0.11057752 0.05824153  0.4683763  0.7655608 0.01755990  0.7113345  0.8346149        0
13  0.9036750  0.6313643 0.07477824 0.12089404 0.04738597  0.5502747  0.7520128 0.16332395  0.7036665  0.8564414        0
14  0.8420276  0.6265071 0.15351674 0.13647090 0.04901864  0.5037902  0.7446693 0.10534171  0.7727812  0.8317943        0
15  0.8995276  0.6515500 0.09214429 0.08973162 0.04231420  0.5071999  0.7484940 0.21822470  0.6859165  0.7775508        0
16  0.9071643  0.7945852 0.15809474 0.11264440 0.04793316  0.5256078  0.8425513 0.17150603  0.7581367  0.8271037        0
17  0.8691358  0.6206902 0.11868549 0.15944891 0.03523320  0.4581166  0.8058461 0.11557264  0.6960848  0.8579109        1
18  0.8330247  0.7030860 0.12832663 0.12936172 0.03534059  0.4687507  0.7630222 0.12176819  0.7179690  0.8775521        1
19  0.9015574  0.6592869 0.12693119 0.14671845 0.03819418  0.4395692  0.7420882 0.10293369  0.7047038  0.8435531        1
20  0.8568249  0.6762936 0.18220218 0.10123198 0.04963466  0.5781550  0.6324743 0.06676272  0.6805745  0.8291353        1
21  0.8799152  0.6736554 0.15056617 0.16070673 0.04944037  0.4015415  0.4587438 0.10392791  0.7467060  0.7396137        1
22  0.8730770  0.6663321 0.10802390 0.14481460 0.04448009  0.5177664  0.6682854 0.16747621  0.7161234  0.8309462        1
23  0.9359656  0.7401368 0.16730300 0.11842173 0.03388908  0.4906018  0.5730439 0.15970761  0.7904663  0.8136450        1
24  0.9320397  0.6978085 0.10474803 0.10607080 0.03268366  0.5362214  0.7832729 0.15564091  0.7171350  0.8511477        1
25  0.8444256  0.7516799 0.16767449 0.12025258 0.04426417  0.5040725  0.6950104 0.16010829  0.7026808  0.8800469        1
26  0.8692707  0.7016945 0.10123979 0.09430876 0.04037325  0.4877716  0.7053603 0.09539885  0.8316933  0.8165352        1
27  0.8738410  0.6230674 0.12793232 0.14837137 0.04878595  0.4335648  0.6547601 0.13714725  0.6944921  0.8788708        1
28  0.9041870  0.6201079 0.12490195 0.16227251 0.04812720  0.4845896  0.6619842 0.13093443  0.7415606  0.8479339        1
29  0.8618622  0.7060291 0.09453812 0.14068246 0.04799782  0.5474036  0.6088231 0.23338428  0.6772588  0.7795908        1
30  0.8776350  0.7132561 0.12100425 0.17367148 0.04399987  0.5661632  0.6905305 0.12971867  0.6788903  0.8198201        1
31  0.9134456  0.7249370 0.07144695 0.08759897 0.04864476  0.6682650  0.7445900 0.16374150  0.7322691  0.8071598        1
32  0.8706637  0.6743936 0.15291891 0.11422262 0.04284591  0.5268217  0.7207478 0.14296945  0.7574967  0.8609048        1
33  0.8821504  0.6845216 0.12004074 0.14009196 0.05527732  0.5677475  0.6379840 0.14122421  0.7090634  0.8386022        1
34  0.9061180  0.5989445 0.09160787 0.14325261 0.05142950  0.5399465  0.6718870 0.08454002  0.6709083  0.8264233        1
35  0.8453511  0.6759766 0.13345672 0.16310764 0.05107034  0.4666146  0.7343603 0.12733287  0.7062292  0.8471812        1
36  0.9004188  0.6114532 0.11837118 0.14667433 0.05050403  0.4975502  0.7258132 0.14894363  0.7195090  0.8382364        1
37  0.9051729  0.6652954 0.15153241 0.14571184 0.05026702  0.4855397  0.7226850 0.12179138  0.7430388  0.8342340        1
38  0.9112012  0.6314450 0.12681305 0.16328649 0.04076789  0.5382251  0.7404122 0.13971506  0.6607798  0.8657917        1
39  0.8407927  0.7148585 0.12792107 0.15447060 0.05287096  0.6798039  0.7182050 0.06549068  0.7433669  0.7948445        1
40  0.8554747  0.7356683 0.22698080 0.21692162 0.05365043  0.4496654  0.7353112 0.13341649  0.8032266  0.7883068        1
41  0.8535359  0.5729331 0.14392737 0.16612463 0.04651752  0.5228045  0.7397588 0.09967424  0.7906682  0.8384434        1
42  0.8059968  0.7148594 0.16774123 0.19006840 0.04990847  0.5929818  0.7011064 0.17921090  0.8121909  0.8481069        1
43  0.8856906  0.6987405 0.19262137 0.18327412 0.04816967  0.4340002  0.6569263 0.13724290  0.7600389  0.7788117        1
44  0.8888717  0.6760166 0.17025712 0.21906969 0.04812641  0.4173613  0.7927178 0.17458413  0.6806101  0.8297604        1
45  0.8691575  0.6682723 0.11932277 0.13669098 0.04014911  0.4680455  0.6186511 0.10002737  0.8012731  0.7177891        1
46  0.9148742  0.7797494 0.13313955 0.15166151 0.03934042  0.4818276  0.7484973 0.16354624  0.6979735  0.8164431        1
47  0.9226736  0.7211714 0.08036409 0.10395457 0.04063595  0.4014187  0.8026643 0.17762644  0.7194800  0.8156545        1

我试图在R中实现该算法,但我仍在继续出错。

尝试:

> train <- beta_values1_updated[training1, ]
> test <- beta_values1_updated[-training1, ]
> labels <- train$response
> ts_label <- test$response
> new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
Error in `[.data.frame`(train, , -c("response"), with = F) : 
  unused argument (with = F)
> new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
Error in `[.data.frame`(test, , -c("response"), with = F) : 
  unused argument (with = F)

我正在尝试按照此处的教程进行操作:

https://www.hackerearth.com/practice/machine-learning/machine-learning-algorithms/beginners-tutorial-on-xgboost-parameter-tuning-r/tutorial/

任何有关如何正确实现XGBoost算法的见解将不胜感激。

编辑:

我要添加其他代码以显示教程中遇到的问题:

train<-data.table(train)
test<-data.table(test)
new_tr <- model.matrix(~.+0,data = train[,-c("response"),with=F])
new_ts <- model.matrix(~.+0,data = test[,-c("response"),with=F])
#convert factor to numeric 
labels <- as.numeric(labels)-1
ts_label <- as.numeric(ts_label)-1
#preparing matrix 
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
#preparing matrix 
dtrain <- xgb.DMatrix(data = new_tr,label = labels)
dtest <- xgb.DMatrix(data = new_ts,label=ts_label)
params <- list(booster = "gbtree", objective = "binary:logistic", eta=0.3, gamma=0, max_depth=6, min_child_weight=1, subsample=1, colsample_bytree=1)
xgbcv <- xgb.cv( params = params, data = dtrain, nrounds = 100, nfold = 5, showsd = T, stratified = T, print.every.n = 10, early.stop.round = 20, maximize = F)
[1] train-error:0.000000+0.000000   test-error:0.000000+0.000000 
Multiple eval metrics are present. Will use test_error for early stopping.
Will train until test_error hasn't improved in 20 rounds.

[11]    train-error:0.000000+0.000000   test-error:0.000000+0.000000 
[21]    train-error:0.000000+0.000000   test-error:0.000000+0.000000 
Stopping. Best iteration:
[1] train-error:0.000000+0.000000   test-error:0.000000+0.000000

Warning messages:
1: 'print.every.n' is deprecated.
Use 'print_every_n' instead.
See help("Deprecated") and help("xgboost-deprecated"). 
2: 'early.stop.round' is deprecated.
Use 'early_stopping_rounds' instead.
See help("Deprecated") and help("xgboost-deprecated").

1 个答案:

答案 0 :(得分:0)

本教程的作者正在使用data.table包。 As you can read here,有时使用with = F来获取单个列。确保已加载并安装data.table和其他软件包以遵循本教程。另外,请确保您的数据集是一个data.table对象。