如何使用python使用StanfordNER对命名实体进行聚类

时间:2018-06-07 08:47:05

标签: python nlp nltk stanford-nlp named-entity-recognition

Stanford NER提供NER罐来检测POS标签和NER。但是我在尝试解析时遇到了一个句子问题。句子如下:

Joseph E. Seagram & Sons, INC said on Thursday that it is merging its two United States based wine companies

以下是我的代码

st = StanfordNERTagger('./stanford- ner/classifiers/english.all.3class.distsim.crf.ser.gz',
                       './stanford-ner/stanford-ner.jar',
                       encoding='utf-8')
ne_in_sent = []
with open("./CCAT/2551newsML.txt") as fd:
    lines = fd.readlines()
    for line in lines:
        print(line)
        tokenized_text = word_tokenize(line)
        classified_text = st.tag(tokenized_text)
        ne_tree = stanfordNE2tree(classified_text)
        for subtree in ne_tree:
            # If subtree is a noun chunk, i.e. NE != "O"
            if type(subtree) == Tree:
                ne_label = subtree.label()
                ne_string = " ".join([token for token, pos in subtree.leaves()])
                ne_in_sent.append((ne_string, ne_label))
                print(ne_in_sent)

当我解析它时,我得到以下实体作为组织。 (Joseph E. Seagram& Sons,Organization)和(Inc,Organization)

还适用于文件中的其他文本,如

TransCo has a very big plane. Transco is moving south.

由于资本化,它区分了组织,因此我得到了 2个实体(TransCo,组织)和(Trensco,组织)。

是否可以将这些转换为一个实体?

1 个答案:

答案 0 :(得分:0)

使用余弦相似度检查器检查相似度

ref:Calculate cosine similarity given 2 sentence strings