在pyspark数据帧上执行PCA

时间:2017-11-15 10:38:56

标签: python apache-spark pyspark

我使用以下方法构建了一个PySpark数据帧:

data = sqlContext.read.load('data.csv' , format='com.databricks.spark.csv', delimiter = ',' ,header='true',inferSchema='true') 

我想在我的数据帧上执行PCA 我的数据帧架构是

>>data
DataFrame[col0: double, col1: double, col2: double, col3: double, col4: double]

>>> data.show()
+---------------+---------------+---------------+---------------+---------------+
|           col0|           col1|           col2|           col3|           col4|
+---------------+---------------+---------------+---------------+---------------+
|   -8.801490628| -1.68848604044|  6.29108688718|  1.68614762629| -2.78418041902|
|  6.99040350558| -2.79455708195| -5.57115314522|  4.22337477957|-0.366589003047|
|   6.8950808389|  7.65514024658|   8.0214838208| -5.12100927058|  3.17467779733|
|  6.74150161414|  1.19627062139| 0.821181991602|  5.12589137044| -3.86248588187|
|  9.15545404244|  7.80553468656|  -8.1232517076|   2.6242726214| -7.59049824307|
|   -6.014643738|-0.470165781449|-0.226389435704| -2.55837378209| -2.06405566854|
| -9.49629160445| -9.85331556717| -7.44474566663|  6.48359295657|  9.75680835864|
| 0.450876020546| -3.55454445478| -2.82100689682|  5.15104966779| -7.70810268078|
| -7.21960567005| 0.102168086158| -1.46779736909| -3.87897074493| -3.17592118456|
| -8.75820987524| -8.63519048007| -4.20447284625|-0.394878764685| -5.79070138764|
|  9.47825273869|  6.02827892008|  -9.7181540689|  -9.0341215112|  5.96203870171|
| -1.56616611175|  1.64353225245|  9.20883287312|-0.158689954569|  4.92646032432|
|-0.952144934546|  -2.9114138684|  2.99204980215| -4.64479019591| -5.99952901402|
|  3.55670956201|-0.812146671595| -1.81243042667|  -1.0765836636|   4.9669633757|
| -2.28427448245| 0.982018554172|   2.2453332695|  1.02432988704| -7.42272905399|
|   5.5901346625|   9.7266134961| 0.372411854139|  4.62762920616| -7.39599025974|
|  9.54828822231| -2.99982461624|  2.17542923571|  6.98459564802|  4.17077742377|
| -6.93309333389|  6.54244346903| 0.783827506295|  4.51631424946|  5.14605443379|
| -1.39844067044|  5.94842772889| 0.270728638304|  4.71245951003|  7.60767471606|
| -7.45885401935| -2.17059549479|  9.13976371571| -7.59189334493|  -2.3924001937|
+---------------+---------------+---------------+---------------+---------------+

要做到这一点,我必须与pyspark.ml.feature合作,这就是我这样做的方式

dataPCA = PCA(k=2, inputCol=str(data.columns), outputCol="pcaFeatures")
model = dataPCA.fit(data)

我收到此错误:

  

pyspark.sql.utils.IllegalArgumentException:u'Field“[\'col0 \',\'col1 \',\'col2 \',\'col3 \',\'col4 \']”不存在

出了什么问题以及如何解决这个问题?

1 个答案:

答案 0 :(得分:2)

<{3}} As mentioned PCA需要Vector列作为输入。{p> mkaran您必须先汇总数据,例如使用VectorAsssemblerRFormula

请按照Encode and assemble multiple features in PySpark中的示例了解详情。

data = RFormula(formula=" ~ {0}".format(" + ".join(data.columns))).fit(data).transform(data)
dataPCA.setInputCol("features").fit(data).transform(data)