如何合并多字NER标签?

时间:2019-05-14 10:17:17

标签: python-3.x ner natural-language-processing allennlp

我当前正在使用allennlp进行NER标签。

代码:

from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("...path to model...")
sentence = "Top Gun was inspired by a newspaper article."
result = predictor.predict(sentence)
lang = {}
for word, tag in zip(result["words"], result["tags"]):
  if tag != "O":
    lang[word] = tag

是否有任何解析器可以合并下面的输出,使其返回“ Top Gun”并标记为“ WORK_OF_ART”?

{'Top': 'B-WORK_OF_ART', 'Gun': 'L-WORK_OF_ART'}

2 个答案:

答案 0 :(得分:1)

您可以更改模型路径并尝试使用

from allennlp.predictors.predictor import Predictor
predictor = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/ner-model-2018.12.18.tar.gz") # change model path
sentence = "Did Uriah honestly think he could beat The Legend of Zelda in under three hours?"
result = predictor.predict(sentence)

lang = {}

completeWord = ""

for word, tag in zip(result["words"], result["tags"]):
    if(tag.startswith("B")):
        completeWord = completeWord + " " +word
        completeWord = completeWord + " " +word
    elif(tag.startswith("L")):
        completeWord = completeWord + " " +word
        lang[completeWord] = tag.split("-")[1]
        completeWord = ""
    else:
        lang[word] = tag

print(lang)

>>>{' The Legend of Zelda': 'MISC',
 '?': 'O',
 'Did': 'O',
 'Uriah': 'U-PER',
 'beat': 'O',
 'could': 'O',
 'he': 'O',
 'honestly': 'O',
 'hours': 'O',
 'in': 'O',
 'think': 'O',
 'three': 'O',
 'under': 'O'}

如果有用,请标记为接受。

答案 1 :(得分:0)

  1. 此存储库包含所有 AllenNLP 模块下载路径。你可以下载任何你需要的东西。 点击here

  2. 从以下路径下载 AllenNLP NER 预训练模型 点击here

  3. 安装 ALLENNLP 和 allennlp-models

    pip install allennlp

    pip install allennlp-models

  4. 导入所需的 AllenNlp 模块

    导入 allennlp

    从 allennlp.predictors.predictor 导入预测器

  5. predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")

  6. Predict 函数调用 AllenNLP 的 Predictor.predict 函数,该函数使用一段文本来分析非结构化文本中的命名实体并将其分类为预定义的类别(单词、标签、掩码和逻辑)。比如一个人的名字、位置、地标等。作为一个库(Pythoncode)

  7. BILOU 方法/模式(我希望 AllenNLP 使用 BILOU 模式)

    | ------|--------------------------------------|
    | BEGIN | The first token of a final entity    |
    | ------|--------------------------------------| 
    | IN    | An inner token of a final entity     |
    | ------|--------------------------------------|
    | LAST  | The final token of a final entity    |
    | ------|--------------------------------------| 
    | Unit  | A single-token entity                |
    | ------|--------------------------------------|
    | Out   | A non-entity token entity            |
    | ------|--------------------------------------|
    

点击here

输入

导入所需的包

    import allennlp
    from allennlp.predictors.predictor import Predictor
    predictor = Predictor.from_path("https://storage.googleapis.com/allennlp-public-models/bert-base-srl-2020.09.03.tar.gz")
      

    document = """The U.S. is a country of 50 states covering a vast swath of North America, with Alaska in the northwest and Hawaii extending the nation’s presence into the Pacific Ocean. Major Atlantic Coast cities are New York, a global finance and culture center, and capital Washington, DC. Midwestern metropolis Chicago is known for influential architecture and on the west coast, Los Angeles' Hollywood is famed for filmmaking"""


    ####### Convert Entities ##########
    def convert_results(allen_results):
        ents = set()
        for word, tag in zip(allen_results["words"], allen_results["tags"]):
            if tag != "O":
                ent_position, ent_type = tag.split("-")
                if ent_position == "U":
                    ents.add((word,ent_type))
                else:
                  if ent_position == "B":
                      w = word
                  elif ent_position == "I":
                      w += " " + word
                  elif ent_position == "L":
                      w += " " + word
                  ents.add((w,ent_type))
        return ents
    

    def allennlp_ner(document):
        return convert_results(predictor.predict(sentence=document))

    results = predictor.predict(sentence=document)
    
    [tuple(i) for i in zip(results["words"],results["tags"])]

    ##Output##
    [('The', 'O'),
    ('U.S.', 'U-LOC'),
    ('is', 'O'),
    ('a', 'O'),
    ('country', 'O'),
    ('of', 'O'),
    ('50', 'O'),
    ('states', 'O'),
    ('covering', 'O'),
    ('a', 'O'),
    ('vast', 'O'),
    ('swath', 'O'),
    ('of', 'O'),
    ('North', 'B-LOC'),
    ('America', 'L-LOC'),
    (',', 'O'),
    ('with', 'O'),
    ('Alaska', 'U-LOC'),
    ('in', 'O'),
    ('the', 'O'),
    ('northwest', 'O'),
    ('and', 'O'),
    ('Hawaii', 'U-LOC'),
    ('extending', 'O'),
    ('the', 'O'),
    ('nation', 'O'),
    ('’s', 'O'),
    ('presence', 'O'),
    ('into', 'O'),
    ('the', 'O'),
    ('Pacific', 'B-LOC'),
    ('Ocean', 'L-LOC'),
    ('.', 'O'),
    ('Major', 'B-LOC'),
    ('Atlantic', 'I-LOC'),
    ('Coast', 'L-LOC'),
    ('cities', 'O'),
    ('are', 'O'),
    ('New', 'B-LOC'),
    ('York', 'L-LOC'),
    (',', 'O'),
    ('a', 'O'),
    ('global', 'O'),
    ('finance', 'O'),
    ('and', 'O'),
    ('culture', 'O'),
    ('center', 'O'),
    (',', 'O'),
    ('and', 'O'),
    ('capital', 'O'),
    ('Washington', 'U-LOC'),
    (',', 'O'),
    ('DC', 'U-LOC'),
    ('.', 'O'),
    ('Midwestern', 'U-MISC'),
    ('metropolis', 'O'),
    ('Chicago', 'U-LOC'),
    ('is', 'O'),
    ('known', 'O'),
    ('for', 'O'),
    ('influential', 'O'),
    ('architecture', 'O'),
    ('and', 'O'),
    ('on', 'O'),
    ('the', 'O'),
    ('west', 'O'),
    ('coast', 'O'),
    (',', 'O'),
    ('Los', 'B-LOC'),
    ('Angeles', 'L-LOC'),
    ("'", 'O'),
    ('Hollywood', 'U-LOC'),
    ('is', 'O'),
    ('famed', 'O'),
    ('for', 'O'),
    ('filmmaking', 'O')]

    # Merging Multiword NER Tags using convert_results
    allennlp_ner(document)
    
    # the output print like this

    {('Alaska', 'LOC'),
    ('Chicago', 'LOC'),
    ('DC', 'LOC'),
    ('Hawaii', 'LOC'),
    ('Hollywood', 'LOC'),
    ('Los', 'LOC'),
    ('Los Angeles', 'LOC'),
    ('Major', 'LOC'),
    ('Major Atlantic', 'LOC'),
    ('Major Atlantic Coast', 'LOC'),
    ('Midwestern', 'MISC'),
    ('New', 'LOC'),
    ('New York', 'LOC'),
    ('North', 'LOC'),
    ('North America', 'LOC'),
    ('Pacific', 'LOC'),
    ('Pacific Ocean', 'LOC'),
    ('U.S.', 'LOC'),
    ('Washington', 'LOC')}