如何使用nltk从字符串中提取名称

时间:2017-11-05 15:09:59

标签: python nlp nltk stanford-nlp

我正在尝试从非结构化字符串中提取名称(印度语)。

这是我的代码:

text = "Balaji Chandrasekaran Bangalore |  Senior Business Analyst/ Lead Business Analyst An accomplished Senior Business Analyst with a track record of handling complex projects in given period of time, exceeding above the expectation. Successful at developing product road maps and leading cross-functional software teams from prototype to release. Professional Competencies Systems Development Life Cycle (SDLC) Agile methodologies Business process improvement Requirements gathering & Analysis Project Management UML Specification UI & UX (Wireframe Designing) Functional Specification Test Scenario Creation SharePoint Admin Work History Senior Business Analyst (Aug 2012 Current) YouBox Technology pvt ltd, Chennai Translating business goals, feature concepts and customer needs into prioritized product requirements and use cases. Expertized in designing innovative wireframes combining user experience analysis and technology models. Extensive Experience in implementing soft wares for Shipping/Logistics firms to handle CRM, Finance, Logistics, Operations, Intermodal, and documentation. Strong interpersonal skills, highly adept at diplomatically facilitating discussions and negotiations with stakeholders. Education Bachelor of Engineering: Electronics & Communication, 2011 CES Tech Hosur Accomplishment Successful onsite implementation at various locations around the globe for Europe Shipping Company. - (Pre Study, General Design, and Functional Specification) Organized Business Analyst Forum and conducted various activities to develop skill sets of Business Analysts."
if text != "":
    grammar = """PERSON: {<NNP>}"""
    chunkParser = nltk.RegexpParser(grammar)
    tagged = nltk.pos_tag(nltk.word_tokenize(text))
    tree = chunkParser.parse(tagged)

    for subtree in tree.subtrees():
        if subtree.label() == "PERSON": 
            pronouns.append(' '.join([c[0] for c in subtree]))

    print(pronouns)
  

[&#39; Balaji&#39;,&#39; Chandrasekaran&#39;,&#39; Bangalore&#39;,&#39; |&#39;,&#39; Senior&#39;, &#39;业务&#39 ;,   &#39;分析师,&#39; /&#39;,&#39;领导&#39;,&#39;商业&#39;&#39;分析师&#39;,&#39;高级&#39;,&#39;商业&#39;,   &#39; Analyst&#39;,&#39;成功&#39;,&#39;开发&#39;&#39;生活&#39;&#39;周期&#39;&#39; SDLC& #39 ;,   &#39;敏捷&#39;商业&#39;,&#39;要求&#39;,&#39;分析&#39;&#39;项目&#39;,   &#39;管理&#39;,&#39; UML&#39;,&#39;规范&#39;,&#39; UI&#39;&#39; UX&#39;&#39;线框& #39 ;,   &#39;设计&#39;,&#39;功能&#39;,&#39;规范&#39;,&#39;测试&#39;场景&#39;,   &#39;创建&#39;,&#39; SharePoint&#39;,&#39;管理员&#39;&#39;工作&#39;,&#39;历史&#39;&#39;高级& #39 ;,   &#39; Business&#39;,&#39; Analyst&#39;,&#39; Aug&#39;,&#39; Current&#39;,&#39; Technology&#39;,&#39; Chennai& #39 ;,   &#39;翻译&#39;,&#39; CRM&#39;,&#39;财务&#39;&#39;物流&#39;&#39;运营&#39;,   &#39; Intermodal&#39;,&#39; Education&#39;,&#39; Bachelor&#39;,&#39; Engineering&#39;,&#39; Electronics&#39;,   &#39;沟通&#39;,&#39;成就&#39;,&#39;成功&#39;地中海&#39;,   &#39; Ship&#39;,&#39;公司&#39;,&#39; MSC&#39;&#39; Georgia&#39;,&#39; MSC&#39;,&#39;柬埔寨& #39;,&#39; MSC&#39;,&#39; MSC&#39;,   &#39; South&#39;&#39;成功&#39;,&#39; Stake&#39;,&#39; MSC&#39;&#39; Geneva&#39;&#39; Switzerland& #39;,&#39; Pre&#39;,   &#39; Study&#39;,&#39; General&#39;,&#39; Design&#39;,&#39; Functional&#39;,&#39; Specification&#39;,&#39; O& #39 ;,   &#39; Business&#39;,&#39; Analyst&#39;,&#39; Forum&#39;,&#39; Business&#39;]

但实际上我只需要获得 Balaji Chandrasekaran ,我甚至尝试使用Standford ner lib.Which未能选择 Balaji Chandrasekaran

任何人都可以帮助从un strcuture字符串中提取名称,或者建议我做任何好的教程。

提前谢谢。

2 个答案:

答案 0 :(得分:1)

就像我在评论中所说的那样,您必须为印度名称创建自己的语料库并根据该文本测试您的文本。 NLTK Book教你如何在Chapter 2中完成这项工作(确切地说是第1.9节)。

date_string = "Fri Aug 10 04:42:47 +0000 2012"

library(lubridate)

parse_date_time(date_string, "abdHMSzY", tz = "GMT")
# [1] "2012-08-10 04:42:47 GMT"

另请参阅:Creating a new corpus with NLTK

答案 1 :(得分:0)

命名实体识别不只是寻找已知名称;识别器使用线索的组合,包括单词的形式和文本的结构。您无法识别的名称出现在标题中,而不是在运行文本中,因此nltk的识别器(无论如何都不是那么好)无法找到它。看看如果在文本中使用此名称会发生​​什么:

>>> text = "Balaji Chandrasekaran is a senior business analyst and lives in Bangalore."
>>> words = nltk.word_tokenize(text)
>>> print(nltk.ne_chunk(nltk.pos_tag(words)))
(S
  (PERSON Balaji/NNP)
  Chandrasekaran/NNP
  is/VBZ
  a/DT
  senior/JJ
  business/NN
  analyst/NN
  and/CC
  lives/NNS
  in/IN
  (GPE Bangalore/NNP)
  ./.)

它错过了姓氏(就像我说识别器不是那么好),但它能够弄清楚这里有一个名字。

换句话说:你的问题是你不是在挖掘文本,而是恢复。唯一的好解决方案是使用您想要处理的相同格式构建和训练带有一些带注释的简历的识别器。这不是非常简单:你需要注释你的训练语料库,并找出你的“特征提取功能”将放在字典中的有用功能(来自单词形式和文档结构的线索)。您需要的所有内容都在nltk book的第6章和第7章的各个部分进行了描述。