Question

如何在Pyspark数据框中使用POS标记执行最小化（不使用熊猫）

我是pyspark的新手，正在尝试使用词性标记进行词形化。我的数据以表格格式显示。具有一列作为文本。我已经清理了文本，但无法用词性将标记简化。

Answer 1

您可能想要使用 Spark-NLP 库，其中包含大量预训练模型，例如对于英语语言，您可以执行以下操作：

pipeline = PretrainedPipeline('explain_document_dl', lang = 'en')

annotations =  pipeline.fullAnnotate("""French author who helped pioner the science-fiction genre. Verne wrate about space, air, and underwater travel before navigable aircrast and practical submarines were invented, and before any means of space travel had been devised.""")[0]

annotations.keys()

执行多项 NLP 任务，包括词形还原和 POS：

+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|                text|            document|            sentence|               token|               spell|              lemmas|               stems|                 pos|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+
|French author who...|[[document, 0, 23...|[[document, 0, 57...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, Fr...|[[token, 0, 5, fr...|[[pos, 0, 5, JJ, ...|
+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+--------------------+

如何在Pyspark数据帧中使用POS标记执行最小化（不使用熊猫）

1 个答案: