我正在使用bank data from UCI只是模板出一个项目。我在他们的文档站点上关注PySpark教程(很抱歉,找不到链接)。通过管道运行时,我总是收到错误消息。我已经加载了数据,转换了要素类型,并且对分类和数字要素进行了流水线处理。我希望收到关于代码任何部分的任何反馈,但特别是在出现错误的位置,以便我可以继续进行此构建。预先谢谢你!
+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
| id|age| job|marital|education|default|balance|housing|loan|contact|day|month|duration|campaign|pdays|previous|poutcome|deposit|
+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
| 1| 59| admin.|married|secondary| no| 2343| yes| no|unknown| 5| may| 1042| 1| -1| 0| unknown| yes|
| 2| 56| admin.|married|secondary| no| 45| no| no|unknown| 5| may| 1467| 1| -1| 0| unknown| yes|
| 3| 41|technician|married|secondary| no| 1270| yes| no|unknown| 5| may| 1389| 1| -1| 0| unknown| yes|
| 4| 55| services|married|secondary| no| 2476| yes| no|unknown| 5| may| 579| 1| -1| 0| unknown| yes|
| 5| 54| admin.|married| tertiary| no| 184| no| no|unknown| 5| may| 673| 2| -1| 0| unknown| yes|
+---+---+----------+-------+---------+-------+-------+-------+----+-------+---+-----+--------+--------+-----+--------+--------+-------+
only showing top 5 rows
# Convert Feature Types
df.createOrReplaceTempView("df")
df2 = spark.sql("select \
cast(id as int) as id, \
cast(age as int) as age, \
cast(job as string) as job, \
cast(marital as string) as marital, \
cast(education as string) as education, \
cast(default as string) as default, \
cast(balance as int) as balance, \
cast(housing as string) as housing, \
cast(loan as string) as loan, \
cast(contact as string) as contact, \
cast(day as int) as day, \
cast(month as string) as month, \
cast(duration as int) as duration, \
cast(campaign as int) as campaign, \
cast(pdays as int) as pdays, \
cast(previous as int) as previous, \
cast(poutcome as string) as poutcome, \
cast(deposit as string) as deposit \
from df")
# Data Types
df2.dtypes
[('id', 'int'),
('age', 'int'),
('job', 'string'),
('marital', 'string'),
('education', 'string'),
('default', 'string'),
('balance', 'int'),
('housing', 'string'),
('loan', 'string'),
('contact', 'string'),
('day', 'int'),
('month', 'string'),
('duration', 'int'),
('campaign', 'int'),
('pdays', 'int'),
('previous', 'int'),
('poutcome', 'string'),
('deposit', 'string')]
# Build Pipeline (Error is Here)
categorical_cols = ["job","marital","education","default","housing","loan","contact","month","poutcome"]
numeric_cols = ["age", "balance", "day", "duration", "campaign", "pdays","previous"]
stages = []
stringIndexer = StringIndexer(inputCol=[cols for cols in categorical_cols],
outputCol=[cols + "_index" for cols in categorical_cols])
encoder = OneHotEncoderEstimator(inputCols=[cols + "_index" for cols in categorical_cols],
outputCols=[cols + "_classVec" for cols in categorical_cols])
stages += [stringIndexer, encoder]
label_string_id = StringIndexer(inputCol="deposit", outputCol="label")
stages += [label_string_id]
assembler_inputs = [cols + "_classVec" for cols in categorical_cols] + numeric_cols
assembler = VectorAssembler(inputCols=assembler_inputs, outputCol="features")
stages += [assembler]
# Run Data Through Pipeline
pipeline = Pipeline().setStages(stages)
pipeline_model = pipeline.fit(df2)
prepped_df = pipeline_model.transform(df2)
“ TypeError:为参数“ inputCols”提供了无效的参数值。无法将job_index转换为字符串列表”
答案 0 :(得分:1)
这是因为OneHotEncoderEstimator
(不同于旧版OneHotEncoder
)占用多列并产生多列(请注意,两个参数都是复数-Cols
不是Col
)。因此,您应该使用list
for cols in categorical_cols:
...
encoder = OneHotEncoderEstimator(
inputCols=[cols + "_index"], outputCols=[cols + "_classVec"]
)
...
或者最好在for
循环之外同时传递所有列:
encoder = OneHotEncoderEstimator(
inputCols=[col + "_index" for cols in categorical_cols],
outputCols=[col + "_classVec" for for col in categorical_cols]
)
stages += [encoder]
如果您不确定预期的输入/输出是什么,可以随时检查相应的Param
:
from pyspark.ml.feature import OneHotEncoderEstimator, StringIndexer
OneHotEncoderEstimator.inputCols.typeConverter
## <function pyspark.ml.param.TypeConverters.toListString(value)>
StringIndexer.inputCol.typeConverter
## <function pyspark.ml.param.TypeConverters.toString(value)>
如您所见,前一个要求对象可强制执行一系列字符串,而后一个仅要求字符串。