Question

我想使用nltk从文本中提取所有国家和国籍提及，我使用POS标记来提取所有GPE标记的标记，但结果并不令人满意。

 abstract="Thyroid-associated orbitopathy (TO) is an autoimmune-mediated orbital inflammation that can lead to disfigurement and blindness. Multiple genetic loci have been associated with Graves' disease, but the genetic basis for TO is largely unknown. This study aimed to identify loci associated with TO in individuals with Graves' disease, using a genome-wide association scan (GWAS) for the first time to our knowledge in TO.Genome-wide association scan was performed on pooled DNA from an Australian Caucasian discovery cohort of 265 participants with Graves' disease and TO (cases) and 147 patients with Graves' disease without TO (controls). "

  sent = nltk.tokenize.wordpunct_tokenize(abstract)
  pos_tag = nltk.pos_tag(sent)
  nes = nltk.ne_chunk(pos_tag)
  places = []
  for ne in nes:
      if type(ne) is nltk.tree.Tree:
         if (ne.label() == 'GPE'):
            places.append(u' '.join([i[0] for i in ne.leaves()]))
      if len(places) == 0:
          places.append("N/A")

获得的结果是：

['Thyroid', 'Australian', 'Caucasian', 'Graves']

有些是国籍，但有些只是名词。

那么我做错了什么或者有其他方式来提取这些信息吗？

Answer 1

因此，在富有成效的评论之后，我深入研究了不同的NER工具，以便最好地识别国籍和国家提及，并发现SPACY有一个NORP实体，可以有效地提取国籍。 https://spacy.io/docs/usage/entity-recognition

Answer 2

如果您想要提取国家/地区名称，您需要的是NER标记符，而不是POS标记符。

命名实体识别（NER）是一种信息提取的子任务，旨在将文本中的元素定位和分类为预定义的类别，例如人员，组织，地点，时间表，数量，货币价值，百分比等。

查看Stanford NER tagger！

from nltk.tag.stanford import NERTagger
import os
st = NERTagger('../ner-model.ser.gz','../stanford-ner.jar')
tagging = st.tag(text.split())

Answer 3

这里使用NLTK执行实体提取的geograpy。它将所有地点和位置存储为地名录。然后，它在地名词典上执行查找以获取相关位置和位置。查找文档以获取更多用法详细信息 -

from geograpy import extraction

e = extraction.Extractor(text="Thyroid-associated orbitopathy (TO) is an autoimmune-
mediated orbital inflammation that can lead to disfigurement and blindness. 
Multiple genetic loci have been associated with Graves' disease, but the genetic 
basis for TO is largely unknown. This study aimed to identify loci associated with 
TO in individuals with Graves' disease, using a genome-wide association scan 
(GWAS) for the first time to our knowledge in TO.Genome-wide association scan was 
performed on pooled DNA from an Australian Caucasian discovery cohort of 265 
participants with Graves' disease and TO (cases) and 147 patients with Graves' 
disease without TO (controls).")

e.find_entities()
print e.places()

Answer 4

您可以将Spacy用于NER。它提供了比NLTK更好的结果。

import spacy

nlp = spacy.load('en_core_web_sm')

doc = nlp(u"Apple is opening its first big office in San Francisco and California.")
print([(ent.text, ent.label_) for ent in doc.ents])

从文本中提取国籍和国家

4 个答案: