我创建了一个准备好的数据框,并使用VectorAssembler
对其进行了转换,以便与ML
lib一起使用:
from pyspark.ml import Pipeline
from pyspark.ml.feature import VectorAssembler, StringIndexer
from pyspark.ml.classification import DecisionTreeClassifier
target_index = StringIndexer(inputCol="target", outputCol="target_idx").fit(df)
assembler = VectorAssembler(
inputCols=[
x for x in df.columns if x not in ['target', 'ident_1', 'id_l', 'target_idx']
],
outputCol='features'
)
cl = DecisionTreeClassifier(labelCol='target_idx', featuresCol='features')
pipe = Pipeline(stages=[target_index, assembler, cl])
model = pipe.fit(df_train)
df_transformed = model.stages[1]
现在我想将转换后的数据集写入ARFF
文件。有没有办法编写已经由VectorAssembler
转换为ARFF
格式的PySpark数据帧?