如何在Pyspark数据框中使用POS标记执行最小化(不使用熊猫)
我是pyspark的新手,正在尝试使用词性标记进行词形化。我的数据以表格格式显示。具有一列作为文本。我已经清理了文本,但无法用词性将标记简化。
答案 0 :(得分:0)
您可能想要使用 Spark-NLP 库,其中包含大量预训练模型,例如对于英语语言,您可以执行以下操作:
pipeline = PretrainedPipeline('explain_document_dl', lang = 'en')
annotations = pipeline.fullAnnotate("""French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented, and before any means of space travel had been devised.""")[0]
annotations.keys()
执行多项 NLP 任务,包括词形还原和 POS:
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
| text| document| sentence| token| spell| lemmas| stems| pos|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|French author who...|[[document, 0, 23...|[[document, 0, 57...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, fr...|[[pos, 0, 5, JJ, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+