AttributeError:“ HashingTF”对象没有属性“ _java_obj”

时间:2019-08-24 03:03:36

标签: pyspark

当我使用pyspark.ml.Pipline创建pipline时,会出现以下问题:   在第18行的文件“ /opt/module/spark-2.4.3-bin-hadoop2.7/Pipeline.py”     hashingTF = HashingTF(ipnutCol = tokenizer.getOutputCol(),outputCol =“功能”)   文件“ /opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/init.py”,第110行 ,在包装中 TypeError: init ()获得了意外的关键字参数“ ipnutCol” 异常在以下位置被忽略: 追溯(最近一次通话):   文件“ /opt/module/spark-2.4.3-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/ml/wrapper.py”,第4行 0,以 del AttributeError:“ HashingTF”对象没有属性“ _java_obj”

我猜API已更改,但我不确定。

# 构建一个机器学习流水线
from pyspark.sql import SparkSession
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.feature import HashingTF, Tokenizer
from pyspark.ml import Pipeline
# 创建一个SparkSession对象
spark = SparkSession.builder.master("local").appName("WorldCount").getOrCreate()

# 1. prepare training documents from a list of (id, text, label) tuples
training = spark.createDataFrame([
    (0, 'a b c d e spark', 1.0),
    (1, 'b d', 0.0),
    (2, 'spark f g h', 1.0),
    (3, 'hadoop mapreduce', 0.0)
],['id','text','label'])
# 2. 定义pipline 中各个流水线阶段PipelineStage.
tokenizer = Tokenizer(inputCol="text", outputCol="words")
hashingTF = HashingTF(ipnutCol=tokenizer.getOutputCol(),outputCol="features")
lr = LogisticRegression(maxIter=10, regParam=0.001)

# 3. 按照具体的处理逻辑有序地组织PipelineStages,并创建一个Pipeline.
pipeline = Pipeline(stages=[tokenizer,hashingTF,lr])

# 4. 训练模型
model = pipeline.fit(training)

# 5. 构建测试数据
test = spark.createDataFrame([
    (4, 'spark i j k'),
    (5, 'i m n'),
    (6, 'spark hadoop spark'),
    (7, 'apache hadoop')
],['id', 'text'])

# 6. 调用之前训练好的PipelineModel的transform()方法,
# 让测试数据按照顺序通过拟合的流水线,生成预测结果
prediction = model.transform(test)
selected = prediction.select('id','text','probability','prediction')
for row in selected.collect():
    rid, text, prob, prediction = row
    print('({},{}) -> prob = {}, prediction={}'.format(rid, text, str(prob),prediction))

(4,spark ijk)-> prob = [0.155543713844,0.844456286156],预测= 1.000000(5,lmn)-> prob = [0.830707735211,0.169292264789],预测= 0.000000(6,spark hadoop spark)- -> prob = [0.0696218406195,0.93037815938],预测= 1.000000(7,apache hadoop)-> prob = [0.981518350351,0.018481649649],预测= 0.000000

1 个答案:

答案 0 :(得分:0)

您在input上的拼写错误(ipnutCol):

TypeError: init() got an unexpected keyword argument 'ipnutCol'