Question

我有一个看起来像这样的表：

+-------+---- -+-------+-------+----
|movieId|Action| Comedy|Fantasy| ...
+-------+----- +-------+-------+----
|  1001 |  1   |   1   |   0   | ...
|  1011 |  0   |   1   |   1   | ...
+-------+------+-------+-------+----

如何将其每一行转换为IndexedRow？所以我有这样的东西：

+-------+----------------+
|movieId|    Features    |
+-------+----------------+
|  1001 | [1, 1, 0, ...] | 
|  1011 | [0, 1, 1, ...] |
+-------+----------------+

Answer 1

如果需要输出数组类型，则可以使用array（）函数。

from pyspark.sql import functions as F
from pyspark.sql.window import Window
tst= spark.createDataFrame([(1,7,80),(1,8,40),(1,5,100),(5,8,90),(7,6,50),(0,3,60)],schema=['col1','col2','col3'])
tst_arr= tst.withColumn("Features",F.array(tst.columns))

tst_arr.show()
+----+----+----+-----------+
|col1|col2|col3|   Features|
+----+----+----+-----------+
|   1|   7|  80| [1, 7, 80]|
|   1|   8|  40| [1, 8, 40]|
|   1|   5| 100|[1, 5, 100]|
|   5|   8|  90| [5, 8, 90]|
|   7|   6|  50| [7, 6, 50]|
|   0|   3|  60| [0, 3, 60]|
+----+----+----+-----------+

如果您要针对ML操作执行此操作，则最好使用向量汇编器：http://spark.apache.org/docs/2.4.0/api/python/_modules/pyspark/ml/feature.html#VectorAssembler

如何在pyspark中将数据帧行转换为IndexedRow？

1 个答案: