Pyspark DataFrame - How to convert one column from categorical values to int?

时间:2017-08-04 13:13:05

标签: python pyspark spark-dataframe

I have a pyspark dataframe and I want to convert one of that column from string to int. Example:

Tabela 1:

+------------+-----+
|categories  |value|
+------------+-----+
|         red| 0.23|
|       green| 0.34|
|      yellow| 0.56|
|       black| 0.11|
|         red| 0.67|
|         red| 0.34|
|       green| 0.45|
+------------+-----+

Table 2:

+------------+-----+
|categ_num   |value|
+------------+-----+
|           1| 0.23|
|           2| 0.34|
|           3| 0.56|
|           4| 0.11|
|           1| 0.67|
|           1| 0.34|
|           2| 0.45|
+------------+-----+

So, in that case: [red=1, green=2, yellow=3 and black=4].

But I don't know all the colors in order to assign it manually. So, I need one way to do the attribution automatically.

Could anyone help me, please?

2 个答案:

答案 0 :(得分:0)

SparkML中有一个StringIndexer

答案 1 :(得分:0)

此代码适用于我:

from pyspark.ml.feature import StringIndexer

df = spark.createDataFrame(
[(0, "a"), (1, "b"), (2, "c"), (3, "a"), (4, "a"), (5, "c")],
["id", "category"])

indexer = StringIndexer(inputCol="category", outputCol="categoryIndex")
indexed = indexer.fit(df).transform(df)
indexed.show()

https://spark.apache.org/docs/latest/ml-features.html#stringindexer