pyspark ml错误-u'要求失败:名称不能为空字符串'

时间:2016-11-21 00:42:50

标签: apache-spark pyspark

我正在尝试使用以下代码创建spark ml kmeans模型并传递数据帧以获取群集

def pre_process_data_for_kmean(dataframe):
    train_data = dataframe.select(col("custid"),col("amount").cast("double").alias("amnt"),col("trantype"),((col("trantime"))).cast("double").alias("date_time"))
    cat1Indexer = StringIndexer(inputCol="custid", outputCol="indexedCat1", handleInvalid="skip")
    cat2Indexer = StringIndexer(inputCol="trantype", outputCol="indexedCat2", handleInvalid="skip")
    cat1Encoder = OneHotEncoder(inputCol="indexedCat1", outputCol="CatVector1")
    cat2Encoder = OneHotEncoder(inputCol="indexedCat2", outputCol="CatVector2")
    cat3Encoder = OneHotEncoder(inputCol="date_time",outputCol="CatVector3")
    fAssembler = VectorAssembler(
    inputCols=["CatVector1","CatVector2","CatVector3","amnt"],
    outputCol="C5")
    cluster_model = KMeans(k=10, seed=1,featuresCol="C5")
    cluster_pipeline = Pipeline(stages=[cat1Indexer, cat1Encoder,cat2Indexer,cat2Encoder,cat3Encoder,fAssembler])
    cluster_model = cluster_pipeline.fit(train_data)
    return cluster_model

我将数据框作为

传递
  train_df = raw_train_df.select(col("dSc").alias("custid"),col("TranAmount").alias("amount"),col("TranDescription").alias("trantype"),func.dayofmonth(col("BusinessDate")).alias("trantime")).na.fill({'trantype':'new_tran_type','custid':'-99999','amount':0,'trantime':1}).dropna()

  cluster_model = pre_process_data_for_kmean(train_df)

现在我明白oneHotEncoder不接受空字符串,我已经采取措施来反击,正如你所看到的那样。但我仍然面临这个错误

请协助。

1 个答案:

答案 0 :(得分:2)

空字符串字面上是空字符串而不是NULLna.filldropna都无济于事。您可以使用na.replace但据我所知它没有列等效,因此您必须为每列调用它:

replacements = {
  'some_col': 'some_replacement', 'another_col': 'another_replacement',
  'numeric_column_wont_be_replaced': 1.0
}

for k, v in replacements.items():
    # We can replace string only if target is string
    # In Python 2 str -> basestring
    if isinstance(v, str):
        df = df.na.replace("", v, [k])