我已经开始使用NLTK找到合适的名词了。但是,我很难找到其中包含小写介词的专有名词(人名和组织名称)。
例如,
The David Eccles School of Business at the University of Utah
变成(使用我的nltk POS标记器):
David Eccles School, Business, University
另一个例子:
The United Nations Economic and Social Council's Economic Commission for Africa
变为
United Nations Economic, Social Council, Economic Commission, Africa
有什么建议吗?
我正在考虑的一些事情(将所有介词和所有介词都大写)
tokens2 = nltk.word_tokenize(x)
tags = nltk.pos_tag(tokens2)
res = nltk.ne_chunk(tags)
tree = []
for subtree in res.subtrees(filter=lambda t: t.node == 'PERSON'):
subtree_l=[]
for leaf in subtree.leaves():
subtree_l.append(leaf[0])
sub = ' '.join(subtree_l)
tree.append(sub)
for subtree in res.subtrees(filter=lambda t: t.node == 'ORGANIZATION'):
subtree_l=[]
for leaf in subtree.leaves():
subtree_l.append(leaf[0])
sub = ' '.join(subtree_l)
tree.append(sub)
x= ', '.join(tree)
count = count+1
print x