如何在Pyspark数据帧中使用POS标记执行最小化(不使用熊猫)

时间:2019-06-06 13:17:34

标签: pyspark nltk lemmatization

如何在Pyspark数据框中使用POS标记执行最小化(不使用熊猫)

我是pyspark的新手,正在尝试使用词性标记进行词形化。我的数据以表格格式显示。具有一列作为文本。我已经清理了文本,但无法用词性将标记简化。

1 个答案:

答案 0 :(得分:0)

您可能想要使用 Spark-NLP 库,其中包含大量预训练模型,例如对于英语语言,您可以执行以下操作:

pipeline = PretrainedPipeline('explain_document_dl', lang = 'en')

annotations =  pipeline.fullAnnotate("""French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented, and before any means of space travel had been devised.""")[0]

annotations.keys()

执行多项 NLP 任务,包括词形还原和 POS:

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|               spell|              lemmas|               stems|                 pos|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|French author who...|[[document, 0, 23...|[[document, 0, 57...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, fr...|[[pos, 0, 5, JJ, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+