使用spaCy时编码问题

时间:2017-05-05 16:31:08

标签: python python-2.7 beautifulsoup spacy

我有一个代码,它使用beautifulsoup从URL中提取文本,然后使用spaCy提取所有人的姓名。代码很好地工作,直到遇到像£等字符.spaCy认为这些是人。例如,这一行:

  

“列出他的政党的脱欧职位,他声称离开了欧盟   会损害英国经济590亿英镑。“

给我'\xc2\xa359bn'作为名字。我尝试使用此处找到的不同建议来修复编码,但没有成功(我##commented他们)。非常感谢您的帮助!

我的代码:

import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')

def get_text(url):
    r = requests.get(url)
    soup = BeautifulSoup(r.content, "lxml")
#    soup = BeautifulSoup(r.content.decode('utf-8','ignore'))

    # delete unwanted tags:
    for s in soup(['figure', 'script', 'style']):
        s.decompose()
    article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]
    text = ''.join(article_soup)
    return text

def get_person(all_tags):
    person_list=[]
    for ent in all_tags.ents:
        if ent.label_=="PERSON":
            person_list.append(str(ent))
    return person_list

def main():
    url = "http://www.bbc.co.uk/news/uk-politics-39784164"
    text=get_text(url)
    text=u"{}".format(text)
#    text = text.decode('cp1251')  # decode from cp1251 byte (str) string to unicode string
#    text = text.encode('utf-8')
    print text
    all_tags = nlp(text)
    names = get_person(all_tags)
    print names      

if __name__ == '__main__':
    main()

0 个答案:

没有答案