我想在维基百科页面中提取所有鱼类并打印这些鱼(我将内容复制到文本文件中)。我使用pos标签,然后使用chunker提取鱼类。但我的输出包含其他不需要的数据,这是我实现的代码
import nltk
from nltk.corpus import stopwords
from nltk.chunk.regexp import RegexpParser
#opening the file and reading
fp = open('C:\\Temp\\fishdata.txt','r')
text = fp.read()
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
sentence_re = r'''(?x) # set flag to allow verbose regexps
([A-Z])(\.[A-Z])+\.? # abbreviations, e.g. U.S.A.
| \w+(-\w+)* # words with optional internal hyphens
| \$?\d+(\.\d+)?%? # currency and percentages, e.g. $12.40, 82%
| \.\.\. # ellipsis
| [][.,;"'?():-_`] # these are separate tokens
'''
chunker = RegexpParser(r'''
NP:
{<NNP><'fish'>}
''')
stpwords = stopwords.words('english')
lemmatizer = nltk.WordNetLemmatizer()
stemmer = nltk.stem.porter.PorterStemmer()
toks = nltk.regexp_tokenize(text, sentence_re)
postoks = nltk.tag.pos_tag(toks)
sent=chunker.parse(postoks)
print sent
我得到的输出
wikipedia
armored
fish
ray-finned
fish
jelly
fish
constucutive
then
oragn
需要输出
armored
fish
jelly
fish
bony
fish
以上只是输出的一小部分,但我需要第二次输出所需的内容 输入是维基百科页面 - http://en.wikipedia.org/wiki/Fish,我将其复制到文本文件中。
答案 0 :(得分:0)
from nltk.corpus import wordnet as wn
fish_words = set()
fish_types = set()
for i in wn.all_synsets():
# if 'fish' exist in a word.
x = [j for j in i.lemma_names if "fish" in j]
fish_words.update(x)
# if a word ends with 'fish'
y = [j for j in i.lemma_names if "fish" in j[-4:]]
fish_types.update(y)
print fish_types
print [i.replace("_"," ")[:-4].strip() for i in fish_types]
我不确定你在寻找什么样的鱼,但只要你依赖WordNetLemmatizer,上述方法就能给你所需的所有鱼。