我很好奇从文本中获取人员和组织名称的一些准确方法。我想根据文本中的伙伴关系等来绘制联盟网络。
我尝试了几种方法: •使用nltk POS - 工作得太慢,所以我放弃了它 •使用匹配的正则表达式是否有第一个大写的连续单词。然而,这导致了很多例外和捕获,其中的mahy并不是非常相关(例如,当某人随机资本化“社会创新奖”时)。此外,这错过了只有一个字的名字。
别人有其他想法吗?
文字示例
obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha
Piper\r\n\r\nThe award was presented during the closing dinner of the Social
Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event
gathered\r\nover 250 business, academic and social thought leaders from the
social\r\nentrepreneurship sector in Canada who had convened for a full day of
inspiration\r\nand engagement on ways to address some of the most pressing issues of our
times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead
an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas,
products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by
MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit
答案 0 :(得分:3)
首先清理您的数据:
>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = """obin Cardozo\r\n\r\nEd Greenspon\r\n\r\nFarouk Jiwa\r\n\r\nDavid Pecaut\r\n\r\nMartha Piper\r\n\r\nThe award was presented during the closing dinner of the Social Entrepreneurship\r\nSummit held at MaRS Centre for Social Innovation in Toronto. The event gathered\r\nover 250 business, academic and social thought leaders from the social\r\nentrepreneurship sector in Canada who had convened for a full day of inspiration\r\nand engagement on ways to address some of the most pressing issues of our times.\r\n\r\nAn often under-recognized community, social entrepreneurs create and lead an\r\norganization that are aimed at catalyzing systemic social change through new\r\nideas, products, services, methodologies and changes in attitude.\r\n\r\nHosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), the\r\nCentre for Social Innovation and the Toronto City Summit Alliance, the Social\r\nEntrepreneurship Summit"""
>>> text = [i.replace('\r\n','').strip() for i in text.split('\r\n\r')]>>> text
['obin Cardozo', 'Ed Greenspon', 'Farouk Jiwa', 'David Pecaut', 'Martha Piper', 'The award was presented during the closing dinner of the Social EntrepreneurshipSummit held at MaRS Centre for Social Innovation in Toronto. The event gatheredover 250 business, academic and social thought leaders from the socialentrepreneurship sector in Canada who had convened for a full day of inspirationand engagement on ways to address some of the most pressing issues of our times.', 'An often under-recognized community, social entrepreneurs create and lead anorganization that are aimed at catalyzing systemic social change through newideas, products, services, methodologies and changes in attitude.', 'Hosted in partnership by MaRS Centre, The Boston Consulting Group (BCG), theCentre for Social Innovation and the Toronto City Summit Alliance, the SocialEntrepreneurship Summit']
然后你需要一个完整的Name Entity Recognizer
,尝试NLTK ne_chunk
作为起点,然后转向更“最先进”的NER识别器:
from nltk import sent_tokenize, word_tokenize, pos_tag
from nltk.tree import Tree
from nltk import batch_ne_chunk as bnc
chunked_text = [[bnc(pos_tag(word_tokenize(j)) for j in sent_tokenize(i))] for i in text]