Question

我有一个函数，它使用斯坦福NER返回给定文本体中的命名实体。

def get_named_entities(text):
    load_ner_files()

    print text[:100] # to show that the text is fine
    text_split = text.split()        
    print text_split # to show the split is working fine
    result = "named entities = ", st.tag(text_split)
    return result

我正在使用报纸Python包从URL中加载文本。

def get_page_text():
    url = "https://aeon.co/essays/elon-musk-puts-his-case-for-a-multi-planet-civilisation"
    page = Article(url)
    page.download()
    page.parse() 
    return unicodedata.normalize('NFKD', page.text).encode('ascii', 'ignore')

但是，当我运行该函数时，我得到以下输出：

['Fuck', 'Earth!', 'Elon', 'Musk', 'said', 'to', 'me,', 'laughing.', 'Who', 'cares', 'about', 'Earth?'......... (continued)
named entities = [('Fuck', 'O'), ('Earth', 'O'), ('!', 'O')]

所以我的问题是，为什么只有前三个单词被标记？

Answer 1

假设有人正确设置了NLTK v3.2，请参阅

<强> TL; DR ：

或

pip install -U nltk

设置NLTK和Stanford工具后（记得设置环境变量）：

conda update nltk

[OUT]：

import time
import urllib.request
from itertools import chain

from bs4 import BeautifulSoup
from nltk import word_tokenize, sent_tokenize
from nltk.tag import StanfordNERTagger

class Article:
    def __init__(self, url, encoding='utf8'):
        self.url = url
        self.encoding='utf8'
        self.text = self.fetch_url_text()
        self.process_text()

    def fetch_url_text(self):
        response = urllib.request.urlopen(self.url)
        self.data = response.read().decode(self.encoding)
        self.bsoup = BeautifulSoup(self.data, 'html.parser')
        return '\n'.join([paragraph.text for paragraph 
                            in self.bsoup.find_all('p')])

    def process_text(self):
        self.paragraphs = [sent_tokenize(p.strip()) 
                            for p in self.text.split('\n') if p]
        _sents = list(chain(*self.paragraphs))
        self.sents = [word_tokenize(sent) for sent in _sents]
        self.words = list(chain(*self.sents))


url = 'https://aeon.co/essays/elon-musk-puts-his-case-for-a-multi-planet-civilisation'

a1 = Article(url)
three_sentences = a1.sents[20:23]

st = StanfordNERTagger('english.all.3class.distsim.crf.ser.gz')


# Tag multiple sentences at one go.
start = time.time()
tagged_sents = st.tag_sents(three_sentences)
print ("Tagging took:", time.time() - start)
print (tagged_sents, end="\n\n")

for sent in tagged_sents:
    print (sent)
print()

# (Much slower) Tagging sentences one at the time and 
# Stanford NER is refired every time.
start = time.time()
tagged_sents = [st.tag(sent) for sent in three_sentences]
print ("Tagging took:", time.time() - start)
for sent in tagged_sents:
    print (sent)
print()

NLTK中的Stanford NER没有正确标记多个句子 - Python

1 个答案: