我有一个看起来像这样的表:
+-------+---- -+-------+-------+----
|movieId|Action| Comedy|Fantasy| ...
+-------+----- +-------+-------+----
| 1001 | 1 | 1 | 0 | ...
| 1011 | 0 | 1 | 1 | ...
+-------+------+-------+-------+----
如何将其每一行转换为IndexedRow?所以我有这样的东西:
+-------+----------------+
|movieId| Features |
+-------+----------------+
| 1001 | [1, 1, 0, ...] |
| 1011 | [0, 1, 1, ...] |
+-------+----------------+
答案 0 :(得分:0)
如果需要输出数组类型,则可以使用array()函数。
from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= spark.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst_arr= tst.withColumn("Features",F.array(tst.columns))
tst_arr.show()
+----+----+----+-----------+
|col1|col2|col3| Features|
+----+----+----+-----------+
| 1| 7| 80| [1, 7, 80]|
| 1| 8| 40| [1, 8, 40]|
| 1| 5| 100|[1, 5, 100]|
| 5| 8| 90| [5, 8, 90]|
| 7| 6| 50| [7, 6, 50]|
| 0| 3| 60| [0, 3, 60]|
+----+----+----+-----------+
如果您要针对ML操作执行此操作,则最好使用向量汇编器:http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/ml/feature.html#VectorAssembler