如何在python中将spark数据帧字符串数组转换为vector

时间:2017-07-26 19:50:12

标签: arrays apache-spark dataframe vector user-defined-functions

我有一张表test_tbl:

+-----------------+--------------+--------------+--+
| test_tbl.label  | test_tbl.f1  | test_tbl.f2  |
+-----------------+--------------+--------------+--+
| 0               | a            | b            |
| 1               | c            | d            |
+-----------------+--------------+--------------+--+

我想将列f1和f2组合为一个带有以下pyspark代码的向量:

arr_to_vector = udf(lambda a: Vectors.dense(a), VectorUDT())
df = sqlContext.sql("""SELECT label,array(f1, f2) as features                         
                         FROM test_tbl""")
df_vector = df.select(df["label"], 
arr_to_vector(df["features"]).alias("features"))
df_vector.show()

然后,我收到了错误:     ValueError:使用序列设置数组元素。

但是,如果我将表中的f1和f2的值更改为数字(尽管列数据类型定义为字符串),如:

+-----------------+--------------+--------------+--+
| test_tbl.label  | test_tbl.f1  | test_tbl.f2  |
+-----------------+--------------+--------------+--+
| 0               | 0.1          | 0.2          |
| 1               | 0.3          | 0.4          |
+-----------------+--------------+--------------+--+

错误消失了,udf工作正常。

有人可以为此提供帮助吗?

1 个答案:

答案 0 :(得分:0)

您可以考虑使用StringIndexer将分类变量转换为float。

https://spark.apache.org/docs/2.2.0/ml-features.html#stringindexer

from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
    [(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
    ["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()