我正在尝试对一组数据(即书籍)执行线性回归,并使用所有属性预测评级。下面是我如何在Excel上格式化我的数据然后将文件传送到csv以将其上传到WEKA
Book Author Genre Publisher Year Rating
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
1 1 5 1 2008 5
我为25本书的清单做了这个,总共有2431个实例。在WEKA上,我已经从“NumericToNominal”转换了前四个属性,然后选择了“线性回归”功能。这是我的结果:
Scheme:weka.classifiers.functions.LinearRegression -S 0 -R 1.0E-8
Relation: Books WEKA-weka.filters.unsupervised.attribute.NumericToNominal-Rfirst-4
Instances: 2430
Attributes: 6
Book
Author
Genre
Publisher
Year
Rating
Test mode:10-fold cross-validation
=== Classifier model (full training set) ===
Linear Regression Model
Rating =
0.2267 * Book=18,15,25,13,8,24,20,17,16,19,11,4,6,21,3,7,23,12,1,9,10,14,2 +
0.4458 * Book=8,24,20,17,16,19,11,4,6,21,3,7,23,12,1,9,10,14,2 +
-0.1527 * Book=24,20,17,16,19,11,4,6,21,3,7,23,12,1,9,10,14,2 +
-0.314 * Book=20,17,16,19,11,4,6,21,3,7,23,12,1,9,10,14,2 +
0.6751 * Book=19,11,4,6,21,3,7,23,12,1,9,10,14,2 +
0.475 * Book=4,6,21,3,7,23,12,1,9,10,14,2 +
-0.4018 * Book=3,7,23,12,1,9,10,14,2 +
0.2522 * Book=7,23,12,1,9,10,14,2 +
-0.4505 * Book=23,12,1,9,10,14,2 +
-0.2583 * Book=12,1,9,10,14,2 +
0.4949 * Book=10,14,2 +
-0.3875 * Author=1,6,2,4,11,12,9,3,13,10,15 +
-0.7318 * Author=6,2,4,11,12,9,3,13,10,15 +
0.594 * Author=2,4,11,12,9,3,13,10,15 +
0.379 * Author=4,11,12,9,3,13,10,15 +
0.6818 * Author=11,12,9,3,13,10,15 +
0.4396 * Author=12,9,3,13,10,15 +
1.0057 * Author=9,3,13,10,15 +
-1.4347 * Author=3,13,10,15 +
-0.4547 * Author=13,10,15 +
0.3638 * Author=10,15 +
-0.4921 * Author=15 +
0.2706 * Genre=7,5,2,1,6,4,8 +
-0.4036 * Genre=5,2,1,6,4,8 +
-0.7927 * Genre=2,1,6,4,8 +
-0.4448 * Genre=1,6,4,8 +
0.5731 * Genre=6,4,8 +
0.5519 * Genre=8 +
0.4517 * Publisher=21,9,8,2,20,10,3,22,5,11,1,18 +
-0.4474 * Publisher=2,20,10,3,22,5,11,1,18 +
-0.3018 * Publisher=10,3,22,5,11,1,18 +
0.474 * Publisher=5,11,1,18 +
0.6567 * Publisher=1,18 +
-0.492 * Publisher=18 +
3.5816
Time taken to build model: 0.28 seconds
=== Cross-validation ===
=== Summary ===
Correlation coefficient 0.2415
Mean absolute error 0.7883
Root mean squared error 0.9772
Relative absolute error 98.4114 %
Root relative squared error 97.0741 %
Total Number of Instances 2430
不是为每个属性显示一个计算,而是显示多个计算,并且您可以看到错误率非常高。我提供导致此问题的数据的方式有什么问题吗?