我用一些经典的统计学书籍中的一些练习和例子来训练自己,并且我调整在Apache Spark
上学到的内容,以确保能够重现。
在本书的一章中,作者根据两个变量来计算PCA。来自美国某些大学排名的HiCi
和SCi
值的主要成分。
该示例从这些值开始,并且有100个人:
| University | X1 (HiCi) | X2 (SCi)
| Harvard | 100 | 100
| Stanford | 86.1 | 70.3
| Berkeley | 67.9 | 69.2
| Cambridge | 54.0 | 65.4
| M.I.T. | 65.9 | 61.7
...
然后作者将这些值标准化:
| University | X1* (HiCi) | X2* (SCi)
| Harvard | 3.70 | 3.19
| Stanford | 2.74 | 0.81
| Berkeley | 1.48 | 0.73
| Cambridge | 0.51 | 0.42
| M.I.T. | 1.34 | 0.13
...
并进行PCA计算,以找到随后的第一个主要成分值:
| University | X1* (HiCi) | X2* (SCi) | Principal component
| Harvard | 3.70 | 3.19 | 4.87
| Stanford | 2.74 | 0.81 | 2.51
| Berkeley | 1.48 | 0.73 | 1.56
| Cambridge | 0.51 | 0.42 | 0.66
| M.I.T. | 1.34 | 0.13 | 1.04
...
为了用Apache Spark 2.4.4
复制它,我编写了这个测试:
/**
* Recherche d'une composante principale.
*/
@Test
@DisplayName("ACP : le classement des universités américaines")
public void composantePrincipale() {
// Le classement (variables HiCi et SCi des universités américaines).
Dataset<Row> dsHiCiSCi = denseRowWithVectors(
d(100, 100), d(86.1, 70.3), d(67.9, 69.2), d(54.0, 65.4), d(65.9, 61.7),
d(58.4, 50.3), d(56.5, 69.6), d(59.3, 46.5), d(50.8, 54.1), d(46.3, 65.4),
d(57.9, 63.2), d(54.5, 65.1), d(57.4, 75.9), d(59.3, 64.6), d(59.9, 70.8),
d(52.4, 74.1), d(52.9, 67.2), d(54.0, 59.8), d(41.3, 67.9), d(41.9, 80.9),
d(60.7, 77.1), d(35.1, 68.6), d(40.6, 62.2), d(39.2, 77.6), d(38.5, 63.2),
d(44.5, 57.6), d(35.5, 38.4), d(39.2, 53.4), d(46.9, 57.0), d(41.3, 53.9),
d(27.7, 23.2), d(46.9, 62.0), d(48.6, 67.0), d(39.9, 45.7), d(42.6, 42.7),
d(31.4, 63.1), d(40.6, 53.3), d(46.9, 54.8), d(23.4, 54.2), d(30.6, 38.0),
d(31.4, 51.0), d(27.7, 56.6), d(45.1, 58.0), d(46.9, 64.2), d(35.5, 48.9),
d(25.7, 51.7), d(39.9, 44.8), d(24.6, 56.9), d(39.9, 65.6), d(37.1, 52.7));
// Afficher les valeurs des universités.
dsHiCiSCi.show(100, false);
// Centrer réduire les valeurs.
StandardScaler scaler = new StandardScaler().setInputCol("features").setOutputCol("scaledFeatures").setWithStd(true).setWithMean(true);
Dataset<Row> transformedDF = scaler.fit(dsHiCiSCi).transform(dsHiCiSCi);
Dataset<Row> centreReduit = transformedDF.select("scaledFeatures");
// De nos deux variables, nous voulons extraire une seule composante principale.
PCA acp = new PCA().setInputCol("scaledFeatures").setK(1);
Dataset<Row> resultat = acp.fit(centreReduit).transform(centreReduit);
resultat.show(100, false);
}
/**
* Créer un Dataset Row par des vecteurs denses constitués à partir d'upplets de valeurs.
* @param listeUppletsValeurs liste de n-upplets.
* @return Dataset Dense.
*/
protected Dataset<Row> denseRowWithVectors(double[]... listeUppletsValeurs) {
List<Row> data = new ArrayList<>();
for(double[] upplet : listeUppletsValeurs) {
Vector vecteur = Vectors.dense(upplet);
data.add(RowFactory.create(vecteur));
}
StructType schema = new StructType(new StructField[] {
new StructField("features", new VectorUDT(), false, Metadata.empty()),
});
return this.session.createDataFrame(data, schema);
}
/** double... => double[] **/
protected double[] d(double... valeurs) {
return valeurs;
}
很好地介绍了值:
+-------------+
|features |
+-------------+
|[100.0,100.0]|
|[86.1,70.3] |
|[67.9,69.2] |
|[54.0,65.4] |
|[65.9,61.7] |
...
标准化是可行的,但是PCA并不是人们所期望的:它具有相反的符号。
+-------------------------------------------+------------------------+
|scaledFeatures |pca_a632d8c48010__output|
+-------------------------------------------+------------------------+
|[3.6421736835169782,3.156450874585141] |[-4.807353527775403] |
|[2.693904143528853,0.8064410737434008] |[-2.4751178396271087] |
|[1.4522850336163453,0.7194036737122256] |[-1.5356158115782794] |
|[0.5040154936282204,0.4187290190590739] |[-0.6524789022238617] |
|[1.3158433731863994,0.12596685531784674] |[-1.0195137897594777] |
...
为什么会有这种行为?