Question

我开发了一种简单的单词提取护照号码的方法（例如，输入-'一百三十五三十五零零获取输出-1353500）

但是如何过滤掉“ ok”，“ mhm”等不相关的单词？

例如，人类可以说“ ok it is 1353500”，然后机器人会从“ ok”，“ it”，“ is”中提取一些无意义的数字，这很不好。问题是如何忽略那些非数字词？

Answer 1

这些基本上是停用词。要删除它们，您需要下载包含所有英语停用词的nltk软件包

from nltk.corpus import stopwords
w = stopwords.words('english')
#lets say data is a string which has your sentence
for word in w:
   if word in data:
       data.replace(word,'')

NLP，忽略无关的单词

1 个答案: