Spark 1.6 VectorAssembler意外结果

时间:2017-03-10 13:03:58

标签: scala apache-spark apache-spark-mllib

我尝试使用Spark' DataFrame创建标签功能VectorAssembler

根据Spark文档,它应该像这样简单:

val incidentDF = sqlContext.sql("select `is_similar`, `cosine_similarity`,..... from some.table")

    //vectorassembler: compact all relevant columns into a vector
    val assembler = new VectorAssembler()
    assembler.setInputCols(Array("cosine_similarity", ....."))
    assembler.setOutputCol("features")

    val output = assembler.transform(incidentDF).select("is_similar", "features").withColumnRenamed("is_similar", "label")

但是,我得到了意想不到的结果。

此:

+----------+---------------------+----------------------------+----------------------+-----------------------------+-----------------------+------------------------------+--------------------+-------------+----------------+-------------+-------------+-------------------+--------+-------------------+---------------------------+----------------------------------+----------------------------+-----------------------------------+-----------------------------+------------------------------------+--------------------+------------------------------------------+-----------------------------------+------------------------------------+-----------------------------+
|0         |0.21437323142813602  |0.08703882797784893         |0.23570226039551587   |0.10050378152592121          |0.10206207261596577    |0.0                           |1                   |1            |1               |1            |1            |1                  |1       |0.26373626373626374|0.012967453461681464       |0.007624195465949381              |0.014425347541872306        |0.008896738386617248               |0.022695267556861232         |0.0                                 |1                   |0.16838138468917166                       |0.15434287415564008                |0.3922322702763681                  |0.34874291623145787          |
|1         |0.5303300858899107   |0.5017452060042545          |0.5303300858899107    |0.5017452060042545           |0.5303300858899107     |0.5017452060042545            |1                   |1            |1               |1            |1            |1                  |1       |0.6870229007633588 |0.3534850108895589         |0.5857224407945156                |0.36079979664267925         |0.5853463384675868                 |0.36971703925333405          |0.5814734067275937                  |0                   |1.0                                       |0.9999999999999998                 |1.0                                 |0.9999999999999998           |
|0         |0.31754264805429416  |0.30151134457776363         |0.33541019662496846   |0.3344968040028363           |0.2867696673382022     |0.26111648393354675           |1                   |1            |0               |1            |1            |1                  |1       |0.41600000000000004|0.10867521883199269        |0.1920005048084368                |0.1322792942407786          |0.2477844869237889                 |0.11802058757911914          |0.16554971608261862                 |1                   |0.0                                       |0.01605611773109364                |0.0                                 |0.16666666666666666          |
|0         |0.16169041669088866  |0.0                         |0.1666666666666667    |0.0                          |0.09622504486493764    |0.0                           |1                   |1            |1               |1            |1            |1                  |1       |0.26666666666666666|0.012517205514308224       |0.0                               |0.012752837227090714        |0.0                                |0.021516657911501622         |0.0                                 |1                   |0.16838138468917166                       |0.15434287415564008                |0.3922322702763681                  |0.34874291623145787          |
|0         |0.2750456656690116   |0.1860521018838127          |0.2858309752375147    |0.19611613513818402          |0.223606797749979      |0.1386750490563073            |1                   |1            |1               |1            |1            |1                  |1       |0.34862385321100914|0.06278282792172384        |0.09178430436891666               |0.06694373400084344         |0.08253907697526759                |0.07508140721703477          |0.10856631569349082                 |1                   |0.3014783135305502                        |0.25688979598845174                |0.5590169943749475                  |0.47628967220784013          |
|0         |0.2449489742783178   |0.19810721293758182         |0.26352313834736496   |0.2307692307692308           |0.21629522817435007    |0.16012815380508716           |1                   |1            |0               |1            |1            |1                  |1       |0.4838709677419355 |0.12209521675839743        |0.19126420671254496               |0.1475066405521753          |0.2459312750965279                 |0.1242978535834829           |0.1886519686826469                  |1                   |0.0                                       |0.01605611773109364                |0.0                                 |0.16666666666666666          |
|0         |0.08320502943378437  |0.09642365197998375         |0.11952286093343938   |0.13912166872805048          |0.0                    |0.0                           |0                   |0            |0               |1            |0            |0                  |1       |0.12               |0.04035362208133099        |0.04456121367953338               |0.04819698770773715         |0.0538656145326838                 |0.0                          |0.0                                 |8                   |0.05825659037076343                       |0.05246835256923818                |0.112089707663561                   |0.11278230910134424          |
|0         |0.20784609690826525  |0.1846372364689991          |0.26111648393354675   |0.24806946917841688          |0.0                    |0.0                           |0                   |0            |0               |1            |0            |1                  |1       |0.0                |0.07233915683015167        |0.0716540790026919                |0.08229370516713722         |0.08299754342027771                |0.0                          |0.0                                 |6                   |0.04977054860197747                       |0.06558734556106822                |0.09607689228305229                 |0.21759706994462227          |
|1         |0.8926577981869824   |0.9066143160193102          |0.914335372996105     |0.9226517385233938           |0.5477225575051661     |0.6324555320336759            |0                   |0            |0               |0            |0            |0                  |1       |0.5309734513274337 |0.8734996606615234         |0.8946928809168011                |0.8791317315987442          |0.8973856295754765                 |0.3496004425218079           |0.48223175160299564                 |0                   |0.0                                       |0.0                                |0.0                                 |0.0                          |
|1         |0.5185629788417315   |0.8432740427115678          |0.5118906968889915    |0.8819171036881969           |0.24253562503633297    |0.3333333333333333            |1                   |1            |0               |1            |1            |1                  |1       |0.09375            |0.18908955158360016        |0.8022196858263557                |0.17544355300115252         |0.8474955187144462                 |0.13927839835275616          |0.2838123484309787                  |6                   |0.0                                       |0.0                                |0.0                                 |0.0                          |
|1         |0.0                  |0.0                         |0.0                   |0.0                          |0.0                    |0.0                           |0                   |0            |1               |1            |0            |0                  |1       |0.14814814814814814|0.0                        |0.0                               |0.0                         |0.0                                |0.0                          |0.0                                 |1                   |0.02170244443925667                       |0.020410228072244255               |0.15062893357603016                 |0.28922903686544305          |
|0         |0.26860765467512676  |0.06271815075053182         |0.29515063885057      |0.07485976927589244          |0.0                    |0.0                           |0                   |0            |1               |1            |0            |0                  |1       |0.08               |0.04804110216570731        |0.03027143543580809               |0.05341183077151175         |0.03431607006581793                |0.0                          |0.0                                 |1                   |0.0                                       |0.022192268824097448               |0.0                                 |0.24019223070763074          |
|1         |0.33333333333333337  |0.40824829046386296         |0.33333333333333337   |0.40824829046386296          |0.33333333333333337    |0.40824829046386296           |0                   |0            |0               |1            |0            |1                  |1       |0.4516129032258064 |0.3310013083604027         |0.3537516145932176                |0.3444032278588375          |0.3667764454925114                 |0.3042153384207993           |0.3408010155297054                  |6                   |0.28297384452448776                       |0.23615630148525626                |0.2182178902359924                  |0.19245008972987526          |
|0         |0.0519174131651165   |0.0                         |0.0917662935482247    |0.0                          |0.0                    |0.0                           |0                   |0            |1               |1            |0            |0                  |1       |0.0967741935483871 |0.03050544547960052        |0.0                               |0.0490339271669166          |0.0                                |0.0                          |0.0                                 |5                   |0.0                                       |0.0                                |0.0                                 |0.0                          |
|0         |0.049160514400834666 |0.0                         |0.02627034687463669   |0.0                          |0.0                    |0.0                           |0                   |0            |0               |0            |0            |0                  |1       |0.1282051282051282 |0.006316709944109247       |0.0                               |0.003132143258557757        |0.0                                |0.0                          |0.0                                 |3                   |0.0                                       |0.019794166951004794               |0.0                                 |0.15638581054280606          |
|0         |0.07082882469748285  |0.0                         |0.08494119857293758   |0.0                          |0.0                    |0.0                           |0                   |0            |0               |1            |0            |1                  |1       |0.06060606060606061|0.004924318378089263       |0.0                               |0.005845759285912874        |0.0                                |0.0                          |0.0                                 |4                   |0.023119472246583003                      |0.010659666129102227               |0.03210289415620512                 |0.04420122177473814          |
|0         |0.1924976258772545   |0.038014296063485276        |0.19149207069693872   |0.02521364528296496          |0.0                    |0.0                           |0                   |0            |0               |1            |0            |1                  |1       |0.125              |0.020931167922971575       |0.00448818821863432               |0.02118543184402528         |0.0026553570889578286              |0.0                          |0.0                                 |5                   |0.02336541089352552                       |0.02401310014140845                |0.11919975664202526                 |0.10760330515353056          |
|1         |0.17095921484405754  |0.08434614994311695         |0.20073126386549828   |0.10085458113185984          |0.0                    |0.0                           |0                   |0            |1               |0            |0            |1                  |1       |0.07407407407407407|0.09182827200781651        |0.05443489342945772               |0.10010815165693956         |0.05842165588249673                |0.0                          |0.0                                 |8                   |0.2973721930047951                        |0.168690765981807                  |0.5637584095764486                  |0.48478000681923245          |
|0         |0.1405456737852613   |0.049147318718299055        |0.11846977555181847   |0.08333333333333333          |0.22360679774997896    |0.0                           |1                   |1            |1               |1            |1            |1                  |1       |0.08333333333333331|0.01937969263670974        |0.003427781939920998              |0.022922840542318093        |0.006443992956721386               |0.03572605281706383          |0.0                                 |5                   |0.26345546669165004                       |0.2557786050767472                 |0.405007416909787                   |0.45121260440202404          |
|1         |0.6793662204867575   |0.753778361444409           |0.5773502691896258    |0.6396021490668313           |0.5773502691896258     |0.8164965809277259            |0                   |0            |1               |1            |0            |0                  |1       |0.6875             |0.7466360531069871         |0.8217912018147824                |0.7034677645212848          |0.6620051533994062                 |0.469853400225108            |0.9321213932723664                  |6                   |0.0                                       |0.011793139853629018               |0.0                                 |0.14433756729740643          |
+----------+---------------------+----------------------------+----------------------+-----------------------------+-----------------------+------------------------------+--------------------+-------------+----------------+-------------+-------------+-------------------+--------+-------------------+---------------------------+----------------------------------+----------------------------+-----------------------------------+-----------------------------+------------------------------------+--------------------+------------------------------------------+-----------------------------------+------------------------------------+-----------------------------+

成为这样:

+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|label|features                                                                                                                                                                                                                                                                                                                                                                     |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|0    |[0.21437323142813602,0.08703882797784893,0.23570226039551587,0.10050378152592121,0.10206207261596577,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.26373626373626374,0.012967453461681464,0.007624195465949381,0.014425347541872306,0.008896738386617248,0.022695267556861232,0.0,1.0,0.16838138468917166,0.15434287415564008,0.3922322702763681,0.34874291623145787]                    |
|1    |[0.5303300858899107,0.5017452060042545,0.5303300858899107,0.5017452060042545,0.5303300858899107,0.5017452060042545,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.6870229007633588,0.3534850108895589,0.5857224407945156,0.36079979664267925,0.5853463384675868,0.36971703925333405,0.5814734067275937,0.0,1.0,0.9999999999999998,1.0,0.9999999999999998]                                     |
|0    |[0.31754264805429416,0.30151134457776363,0.33541019662496846,0.3344968040028363,0.2867696673382022,0.26111648393354675,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.41600000000000004,0.10867521883199269,0.1920005048084368,0.1322792942407786,0.2477844869237889,0.11802058757911914,0.16554971608261862,1.0,0.0,0.01605611773109364,0.0,0.16666666666666666]                             |
|0    |[0.16169041669088866,0.0,0.1666666666666667,0.0,0.09622504486493764,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.26666666666666666,0.012517205514308224,0.0,0.012752837227090714,0.0,0.021516657911501622,0.0,1.0,0.16838138468917166,0.15434287415564008,0.3922322702763681,0.34874291623145787]                                                                                       |
|0    |[0.2750456656690116,0.1860521018838127,0.2858309752375147,0.19611613513818402,0.223606797749979,0.1386750490563073,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.34862385321100914,0.06278282792172384,0.09178430436891666,0.06694373400084344,0.08253907697526759,0.07508140721703477,0.10856631569349082,1.0,0.3014783135305502,0.25688979598845174,0.5590169943749475,0.47628967220784013]|
|0    |[0.2449489742783178,0.19810721293758182,0.26352313834736496,0.2307692307692308,0.21629522817435007,0.16012815380508716,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.4838709677419355,0.12209521675839743,0.19126420671254496,0.1475066405521753,0.2459312750965279,0.1242978535834829,0.1886519686826469,1.0,0.0,0.01605611773109364,0.0,0.16666666666666666]                               |
|0    |[0.08320502943378437,0.09642365197998375,0.11952286093343938,0.13912166872805048,0.0,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,0.12,0.04035362208133099,0.04456121367953338,0.04819698770773715,0.0538656145326838,0.0,0.0,8.0,0.05825659037076343,0.05246835256923818,0.112089707663561,0.11278230910134424]                                                                          |
|0    |[0.20784609690826525,0.1846372364689991,0.26111648393354675,0.24806946917841688,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.0,0.07233915683015167,0.0716540790026919,0.08229370516713722,0.08299754342027771,0.0,0.0,6.0,0.04977054860197747,0.06558734556106822,0.09607689228305229,0.21759706994462227]                                                                          |
|1    |(25,[0,1,2,3,4,5,12,13,14,15,16,17,18,19],[0.8926577981869824,0.9066143160193102,0.914335372996105,0.9226517385233938,0.5477225575051661,0.6324555320336759,1.0,0.5309734513274337,0.8734996606615234,0.8946928809168011,0.8791317315987442,0.8973856295754765,0.3496004425218079,0.48223175160299564])                                                                      |
|1    |[0.5185629788417315,0.8432740427115678,0.5118906968889915,0.8819171036881969,0.24253562503633297,0.3333333333333333,1.0,1.0,0.0,1.0,1.0,1.0,1.0,0.09375,0.18908955158360016,0.8022196858263557,0.17544355300115252,0.8474955187144462,0.13927839835275616,0.2838123484309787,6.0,0.0,0.0,0.0,0.0]                                                                            |
|1    |(25,[8,9,12,13,20,21,22,23,24],[1.0,1.0,1.0,0.14814814814814814,1.0,0.02170244443925667,0.020410228072244255,0.15062893357603016,0.28922903686544305])                                                                                                                                                                                                                       |
|0    |(25,[0,1,2,3,8,9,12,13,14,15,16,17,20,22,24],[0.26860765467512676,0.06271815075053182,0.29515063885057,0.07485976927589244,1.0,1.0,1.0,0.08,0.04804110216570731,0.03027143543580809,0.05341183077151175,0.03431607006581793,1.0,0.022192268824097448,0.24019223070763074])                                                                                                   |
|1    |[0.33333333333333337,0.40824829046386296,0.33333333333333337,0.40824829046386296,0.33333333333333337,0.40824829046386296,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.4516129032258064,0.3310013083604027,0.3537516145932176,0.3444032278588375,0.3667764454925114,0.3042153384207993,0.3408010155297054,6.0,0.28297384452448776,0.23615630148525626,0.2182178902359924,0.19245008972987526]|
|0    |(25,[0,2,8,9,12,13,14,16,20],[0.0519174131651165,0.0917662935482247,1.0,1.0,1.0,0.0967741935483871,0.03050544547960052,0.0490339271669166,5.0])                                                                                                                                                                                                                              |
|0    |(25,[0,2,12,13,14,16,20,22,24],[0.049160514400834666,0.02627034687463669,1.0,0.1282051282051282,0.006316709944109247,0.003132143258557757,3.0,0.019794166951004794,0.15638581054280606])                                                                                                                                                                                     |
|0    |(25,[0,2,9,11,12,13,14,16,20,21,22,23,24],[0.07082882469748285,0.08494119857293758,1.0,1.0,1.0,0.06060606060606061,0.004924318378089263,0.005845759285912874,4.0,0.023119472246583003,0.010659666129102227,0.03210289415620512,0.04420122177473814])                                                                                                                         |
|0    |[0.1924976258772545,0.038014296063485276,0.19149207069693872,0.02521364528296496,0.0,0.0,0.0,0.0,0.0,1.0,0.0,1.0,1.0,0.125,0.020931167922971575,0.00448818821863432,0.02118543184402528,0.0026553570889578286,0.0,0.0,5.0,0.02336541089352552,0.02401310014140845,0.11919975664202526,0.10760330515353056]                                                                   |
|1    |[0.17095921484405754,0.08434614994311695,0.20073126386549828,0.10085458113185984,0.0,0.0,0.0,0.0,1.0,0.0,0.0,1.0,1.0,0.07407407407407407,0.09182827200781651,0.05443489342945772,0.10010815165693956,0.05842165588249673,0.0,0.0,8.0,0.2973721930047951,0.168690765981807,0.5637584095764486,0.48478000681923245]                                                            |
|0    |[0.1405456737852613,0.049147318718299055,0.11846977555181847,0.08333333333333333,0.22360679774997896,0.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,0.08333333333333331,0.01937969263670974,0.003427781939920998,0.022922840542318093,0.006443992956721386,0.03572605281706383,0.0,5.0,0.26345546669165004,0.2557786050767472,0.405007416909787,0.45121260440202404]                        |
|1    |[0.6793662204867575,0.753778361444409,0.5773502691896258,0.6396021490668313,0.5773502691896258,0.8164965809277259,0.0,0.0,1.0,1.0,0.0,0.0,1.0,0.6875,0.7466360531069871,0.8217912018147824,0.7034677645212848,0.6620051533994062,0.469853400225108,0.9321213932723664,6.0,0.0,0.011793139853629018,0.0,0.14433756729740643]                                                  |
+-----+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+

正如你在这里看到的,结果有两个不同的结果,而不仅仅是一个统一的矢量。 这是CDH的火花(1.6)中的错误还是我错过了什么?

1 个答案:

答案 0 :(得分:1)

TL; DR 这是正常行为。

您的数据包含许多稀疏行。组装后,它们将转换为SparseVector并在输出中表示为

(size, [idx1, idx2, ..., idxm], [val1, val2, ..., valm])

其中idx1 .. indm是非零值的位置,val1 .. valm对应的值。所以关注

(25,[8,9,12,13, ...],[1.0,1.0,1.0,0.14814814814814814, ...])

是大小为25的SparseVector,其中第9个位置等于1.0,第13个位置为0.1到0.148。

如果数据密集(不到一半的值等于零),您的输入中的DenseVectors表示为:

[val0, val1, ..., valn]

两种表示都是完全有效的,大多数工具都可以接受。