我使用以下方法构建了一个PySpark数据帧:
data = sqlContext.read.load('data.csv' , format='com.databricks.spark.csv', delimiter = ',' ,header='true',inferSchema='true')
我想在我的数据帧上执行PCA 我的数据帧架构是
>>data
DataFrame[col0: double, col1: double, col2: double, col3: double, col4: double]
>>> data.show()
+---------------+---------------+---------------+---------------+---------------+
| col0| col1| col2| col3| col4|
+---------------+---------------+---------------+---------------+---------------+
| -8.801490628| -1.68848604044| 6.29108688718| 1.68614762629| -2.78418041902|
| 6.99040350558| -2.79455708195| -5.57115314522| 4.22337477957|-0.366589003047|
| 6.8950808389| 7.65514024658| 8.0214838208| -5.12100927058| 3.17467779733|
| 6.74150161414| 1.19627062139| 0.821181991602| 5.12589137044| -3.86248588187|
| 9.15545404244| 7.80553468656| -8.1232517076| 2.6242726214| -7.59049824307|
| -6.014643738|-0.470165781449|-0.226389435704| -2.55837378209| -2.06405566854|
| -9.49629160445| -9.85331556717| -7.44474566663| 6.48359295657| 9.75680835864|
| 0.450876020546| -3.55454445478| -2.82100689682| 5.15104966779| -7.70810268078|
| -7.21960567005| 0.102168086158| -1.46779736909| -3.87897074493| -3.17592118456|
| -8.75820987524| -8.63519048007| -4.20447284625|-0.394878764685| -5.79070138764|
| 9.47825273869| 6.02827892008| -9.7181540689| -9.0341215112| 5.96203870171|
| -1.56616611175| 1.64353225245| 9.20883287312|-0.158689954569| 4.92646032432|
|-0.952144934546| -2.9114138684| 2.99204980215| -4.64479019591| -5.99952901402|
| 3.55670956201|-0.812146671595| -1.81243042667| -1.0765836636| 4.9669633757|
| -2.28427448245| 0.982018554172| 2.2453332695| 1.02432988704| -7.42272905399|
| 5.5901346625| 9.7266134961| 0.372411854139| 4.62762920616| -7.39599025974|
| 9.54828822231| -2.99982461624| 2.17542923571| 6.98459564802| 4.17077742377|
| -6.93309333389| 6.54244346903| 0.783827506295| 4.51631424946| 5.14605443379|
| -1.39844067044| 5.94842772889| 0.270728638304| 4.71245951003| 7.60767471606|
| -7.45885401935| -2.17059549479| 9.13976371571| -7.59189334493| -2.3924001937|
+---------------+---------------+---------------+---------------+---------------+
要做到这一点,我必须与pyspark.ml.feature
合作,这就是我这样做的方式
dataPCA = PCA(k=2, inputCol=str(data.columns), outputCol="pcaFeatures")
model = dataPCA.fit(data)
我收到此错误:
pyspark.sql.utils.IllegalArgumentException:u'Field“[\'col0 \',\'col1 \',\'col2 \',\'col3 \',\'col4 \']”不存在
出了什么问题以及如何解决这个问题?
答案 0 :(得分:2)
Vector
列作为输入。{p> mkaran您必须先汇总数据,例如使用VectorAsssembler
或RFormula
。
请按照Encode and assemble multiple features in PySpark中的示例了解详情。
data = RFormula(formula=" ~ {0}".format(" + ".join(data.columns))).fit(data).transform(data)
dataPCA.setInputCol("features").fit(data).transform(data)