如何为我的NLP部署的Apache SparkML模型评分

时间:2019-06-24 22:03:14

标签: java python nlp apache-spark-mllib ibm-watson

我在消费者评论的火花数据框上使用NLTK创建了一个单词袋模型。我的最终数据集中有3列:情感,文本和袋词。架构如下所示

    StructType(List(StructField(Sentiment,StringType,true),StructField(text,StringType,true),StructField(bagofwords,ArrayType(StringType,true),true)))

bagofwords列中的每个记录都是已删除标点符号和停用词的单词的列表。我认为这是引起问题的原因。

我想通过传递这样的json负载来对我部署的spark ml模型进行评分

scoring_payload = {"fields": ["text", "bagofwords"], "values": ["I hate this place, they are very incompetent", "['this', 'place', 'hate', 'they', 'incompetent']"]}

但是我不断收到错误消息,例如:

Status code: 400, body: {
  "trace": "ff8e614b33c635684e648e2c6705d9eb",
  "errors": [{
    "code": "invalid_payload",
    "message": "Input Json parsing failed with error: java.lang.ClassCastException"
  }]
}

我还不熟悉Java或Scala,但到目前为止我仍可以推断,我认为问题与从数组/列表到字符串的转换有关,反之亦然。

我尝试通过转储为Json来调整有效负载,但这也会引发错误。 我还按照以下链接中显示的步骤进行操作:

https://dataplatform.cloud.ibm.com/analytics/notebooks/1fed143e-1877-42bd-b927-7d366e73745b/view?access_token=4b39718f9e1f1de55e6e67e8dcbb5f0cac848f390d73478d0dea9c1a8af24550

final_dataset1 = spark.read.parquet('final_sparkml_dataset_pq')
final_dataset1.show()

+---------+--------------------+--------------------+
|Sentiment|                text|          bagofwords|
+---------+--------------------+--------------------+
| negative|You need to doubl...|[something, cold,...|
| negative|Now first off I a...|[out, actually, f...|
| negative|I should have bee...|[we, was, gel, my...|
| negative|We stayed at the ...|[out, ball, cater...|
| negative|I figured I would...|[, respond, compa...|
| negative|Asked for blonde,...|[absolutely, awfu...|
| negative|There are places ...|[grumble, envisio...|
| negative|This place is ter...|[was, for, plotti...|
| negative|I had went here a...|[popped, circumst...|


from watson_machine_learning_client import WatsonMachineLearningAPIClient
wml_credentials = {
  "apikey": "***",
  "iam_apikey_description": "Auto-generated for key ***",
  "iam_apikey_name": "wdp-writer",
  "iam_role_crn": "crn:v1:bluemix:public:iam::::serviceRole:Writer",
  "iam_serviceid_crn": "***",
  "instance_id": "***",
  "password": "***",
  "url": "https://us-south.ml.cloud.ibm.com",
  "username": "**"
}
client = WatsonMachineLearningAPIClient(wml_credentials)
created_deployment = client.deployments.create(published_model_uid, name="Sentiment Predictor SparkML")
scoring_endpoint = client.deployments.get_scoring_url(dep_details)
scoring_payload = {"fields": ["text", "bagofwords"], "values": ["I hate this place, they are very incompetent", "['this', 'place', 'hate', 'they', 'incompetent']"]}
deploy_model_pred = client.deployments.score(scoring_endpoint, scoring_payload)

我不断收到演员表错误:

Status code: 400, body: {
  "trace": "ff8e614b33c635684e648e2c6705d9eb",
  "errors": [{
    "code": "invalid_payload",
    "message": "Input Json parsing failed with error: java.lang.ClassCastException"
  }]
}

我期望输出“ rawprediction”,“ probability”,“ predictionlabel”和其他结果,类似于在测试数据上运行转换方法时通常会得到的结果

关于我在做什么错的任何想法?在这种情况下,如何使有效载荷有效?

0 个答案:

没有答案