Question

我正在阅读新闻文章，并使用nltk进行pos标签。我要删除那些没有pos标记（如CD（数字））的行。

import io
import nltk
from nltk.corpus import stopwords 
from nltk.tokenize import word_tokenize
from nltk import pos_tag
stop_words = set(stopwords.words('english')) 
file1 = open("etorg.txt") 
line = file1.read()
file1.close()
print(line)
words = line.split() 
tokens = nltk.pos_tag(words)

如何删除所有不包含CD标签的句子？

Answer 1

只需使用[word for word in tokens if word[1] != 'CD']

编辑：要获取没有数字的句子，请使用以下代码：

def has_number(sentence):
    for i in nltk.pos_tag(sentence.split()):
        if i[1] == 'CD':
            return ''
    return sentence

line = 'MNC claims 21 million sales in September. However, industry sources do not confirm this data. It is estimated that the reported sales could be in the range of fifteen to 18 million. '

''.join([has_number(x) for x in line.split('.')])

> ' However, industry sources do not confirm this data '

如果没有pos标签（如CD），如何删除整行？

1 个答案: