Pyspark - > StringIndexer:"无"值被替换为数字

时间:2018-04-29 12:16:59

标签: apache-spark pyspark

我的数据框有几个"无"值。 通过 StringIndexer 将字符串列转换为浮点列后,"无"值被替换为数字。

问题: 如何将字符串列转换为浮点列,但保持"无"值为"无"?

感谢。

# Transform nominal attributes to numeric attributes
for columnName, columnType in self.rawData.dtypes:
    if columnType == "string":
        self.rawData = PreProcess.TransformNominalToNumeric(self.rawData, columnName)



class PreProcess:
    @staticmethod
    def TransformNominalToNumeric(dataFrame, inputColumn):
        """Transformation of nominal attributes into numeric"""
        outputColumn = inputColumn + "_index"
        indexer = StringIndexer(inputCol = inputColumn, outputCol = outputColumn, handleInvalid = "keep")
        indexer = indexer.fit(dataFrame)
        dataFrame = indexer.transform(dataFrame)
        dataFrame = dataFrame.drop(inputColumn)
        dataFrame = dataFrame.withColumnRenamed(outputColumn, inputColumn)
        return dataFrame

1 个答案:

答案 0 :(得分:2)

Since keep

  

将无效数据放在特殊的附加存储区中,索引为numLabels

您可以在transform

之后手动替换值
from pyspark.sql.functions import col, when

dataFrame = spark.createDataFrame(["a", None, "b"], "string").toDF("value")

inputColumn = "value"
outputColumn = inputColumn + "_index"

indexer = StringIndexer(
     inputCol=inputColumn, outputCol=outputColumn, handleInvalid="keep"
).fit(dataFrame)

(indexer
   .transform(dataFrame)
   .withColumn(outputColumn, when(col(outputColumn) == len(indexer.labels), None).otherwise(col(outputColumn)))
   .show())
# +-----+-----------+
# |value|value_index|
# +-----+-----------+
# |    a|        0.0|
# | null|       null|
# |    b|        1.0|
# +-----+-----------+

但是如果您打算稍后使用pyspark.ml,则没有任何价值。没有pyspark.ml算法接受NULL s,因此您必须先进行插入,删除和编码(如此处),具体取决于类型和要求,然后才能继续。