我的数据框有几个"无"值。 通过 StringIndexer 将字符串列转换为浮点列后,"无"值被替换为数字。
问题: 如何将字符串列转换为浮点列,但保持"无"值为"无"?
感谢。
# Transform nominal attributes to numeric attributes
for columnName, columnType in self.rawData.dtypes:
if columnType == "string":
self.rawData = PreProcess.TransformNominalToNumeric(self.rawData, columnName)
class PreProcess:
@staticmethod
def TransformNominalToNumeric(dataFrame, inputColumn):
"""Transformation of nominal attributes into numeric"""
outputColumn = inputColumn + "_index"
indexer = StringIndexer(inputCol = inputColumn, outputCol = outputColumn, handleInvalid = "keep")
indexer = indexer.fit(dataFrame)
dataFrame = indexer.transform(dataFrame)
dataFrame = dataFrame.drop(inputColumn)
dataFrame = dataFrame.withColumnRenamed(outputColumn, inputColumn)
return dataFrame
答案 0 :(得分:2)
将无效数据放在特殊的附加存储区中,索引为numLabels
您可以在transform
from pyspark.sql.functions import col, when
dataFrame = spark.createDataFrame(["a", None, "b"], "string").toDF("value")
inputColumn = "value"
outputColumn = inputColumn + "_index"
indexer = StringIndexer(
inputCol=inputColumn, outputCol=outputColumn, handleInvalid="keep"
).fit(dataFrame)
(indexer
.transform(dataFrame)
.withColumn(outputColumn, when(col(outputColumn) == len(indexer.labels), None).otherwise(col(outputColumn)))
.show())
# +-----+-----------+
# |value|value_index|
# +-----+-----------+
# | a| 0.0|
# | null| null|
# | b| 1.0|
# +-----+-----------+
但是如果您打算稍后使用pyspark.ml
,则没有任何价值。没有pyspark.ml
算法接受NULL
s,因此您必须先进行插入,删除和编码(如此处),具体取决于类型和要求,然后才能继续。