我有一个代码,它使用beautifulsoup从URL中提取文本,然后使用spaCy提取所有人的姓名。代码很好地工作,直到遇到像£等字符.spaCy认为这些是人。例如,这一行:
“列出他的政党的脱欧职位,他声称离开了欧盟 会损害英国经济590亿英镑。“
给我'\xc2\xa359bn'
作为名字。我尝试使用此处找到的不同建议来修复编码,但没有成功(我##commented他们)。非常感谢您的帮助!
我的代码:
import urllib
import requests
from bs4 import BeautifulSoup
import spacy
from spacy.en import English
from __future__ import unicode_literals
nlp_toolkit = English()
nlp = spacy.load('en')
def get_text(url):
r = requests.get(url)
soup = BeautifulSoup(r.content, "lxml")
# soup = BeautifulSoup(r.content.decode('utf-8','ignore'))
# delete unwanted tags:
for s in soup(['figure', 'script', 'style']):
s.decompose()
article_soup = [s.get_text(separator="\n", strip=True) for s in soup.find_all( 'div', {'class': 'story-body__inner'})]
text = ''.join(article_soup)
return text
def get_person(all_tags):
person_list=[]
for ent in all_tags.ents:
if ent.label_=="PERSON":
person_list.append(str(ent))
return person_list
def main():
url = "http://www.bbc.co.uk/news/uk-politics-39784164"
text=get_text(url)
text=u"{}".format(text)
# text = text.decode('cp1251') # decode from cp1251 byte (str) string to unicode string
# text = text.encode('utf-8')
print text
all_tags = nlp(text)
names = get_person(all_tags)
print names
if __name__ == '__main__':
main()