我正在阅读新闻文章,并使用nltk进行pos标签。我要删除那些没有pos标记(如CD(数字))的行。
import io
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english'))
file1 = open("etorg.txt")
line = file1.read()
file1.close()
print(line)
words = line.split()
tokens = nltk.pos_tag(words)
如何删除所有不包含CD标签的句子?
答案 0 :(得分:0)
只需使用[word for word in tokens if word[1] != 'CD']
编辑:要获取没有数字的句子,请使用以下代码:
def has_number(sentence):
for i in nltk.pos_tag(sentence.split()):
if i[1] == 'CD':
return ''
return sentence
line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '
''.join([has_number(x) for x in line.split('.')])
> ' However, industry sources do not confirm this data '