Question

我正在尝试编写一个简单的python脚本，它导入一个* .txt文件并使用NLTK模块对其进行标记。

我遇到的挑战是必须对完整的语料库进行标记，但每个标记的长度必须小于或等于200个字符 - NLTK工具箱中是否有一个本机函数可以实现此目的？

一个例子：将前几段标记为“战争与和平”会产生以下令牌，其长度为303个字符

token = ["But I warn you, if you don't tell me that this means war, if you still try to defend the infamies and horrors perpetrated by that Antichrist--I really believe he is Antichrist--I will have nothing more to do with you and you are no longer my friend, no longer my 'faithful slave,' as you call yourself"]

它仍然包含标点符号（逗号，连字符），我可以编写一个函数来使用这些类型的断点来破解句子，我的问题是NLTK（或其他语言解析器？）中是否已存在本机功能会这样做并有效处理角落案件吗？

Answer 1

我不确定你要做什么，但如果你只想标记少于200个字符的单词：

import nltk
with open('somefile.txt','r') as fp:
    tokenized_text = [word for word in nltk.tokenize.word_tokenize(fp.read()) if len(word) <= 200]

它只保留少于或等于200个字符的标记，并丢弃其余标记。如果您需要更多粒度控制，则可能需要查看正则表达式附：对不起，如果我误解了你的问题。

NLTK / Python：将文本标记为固定标记长度

1 个答案: