当我使用pyspark.ml.Pipline创建pipline时,会出现以下问题: 在第18行的文件“ /opt/module/spark-2.4.3-bin-hadoop2.7/Pipeline.py” hashingTF = HashingTF(ipnutCol = tokenizer.getOutputCol(),outputCol =“功能”) 文件“ /opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/init.py”,第110行 ,在包装中 TypeError: init ()获得了意外的关键字参数“ ipnutCol” 异常在以下位置被忽略: 追溯(最近一次通话): 文件“ /opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py”,第4行 0,以 del AttributeError:“ HashingTF”对象没有属性“ _java_obj”
我猜API已更改,但我不确定。
# 构建一个机器学习流水线
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml import Pipeline
# 创建一个SparkSession对象
spark = SparkSession.builder.master("local").appName("WorldCount").getOrCreate()
# 1. prepare training documents from a list of (id, text, label) tuples
training = spark.createDataFrame([
(0, 'a b c d e spark', 1.0),
(1, 'b d', 0.0),
(2, 'spark f g h', 1.0),
(3, 'hadoop mapreduce', 0.0)
],['id','text','label'])
# 2. 定义pipline 中各个流水线阶段PipelineStage.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)
# 3. 按照具体的处理逻辑有序地组织PipelineStages,并创建一个Pipeline.
pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])
# 4. 训练模型
model = pipeline.fit(training)
# 5. 构建测试数据
test = spark.createDataFrame([
(4, 'spark i j k'),
(5, 'i m n'),
(6, 'spark hadoop spark'),
(7, 'apache hadoop')
],['id', 'text'])
# 6. 调用之前训练好的PipelineModel的transform()方法,
# 让测试数据按照顺序通过拟合的流水线,生成预测结果
prediction = model.transform(test)
selected = prediction.select('id','text','probability','prediction')
for row in selected.collect():
rid, text, prob, prediction = row
print('({},{}) -> prob = {}, prediction={}'.format(rid, text, str(prob),prediction))
(4,spark ijk)-> prob = [0.155543713844,0.844456286156],预测= 1.000000(5,lmn)-> prob = [0.830707735211,0.169292264789],预测= 0.000000(6,spark hadoop spark)- -> prob = [0.0696218406195,0.93037815938],预测= 1.000000(7,apache hadoop)-> prob = [0.981518350351,0.018481649649],预测= 0.000000
答案 0 :(得分:0)
您在input
上的拼写错误(ipnutCol
):
TypeError: init() got an unexpected keyword argument 'ipnutCol'