从Python中的非结构化文本中提取一个人的年龄

时间:2019-08-07 13:03:49

标签: python nlp pattern-matching text-mining

我有一个包含简短履历的行政文件数据集。我正在尝试通过使用python和某些模式匹配来提取人们的年龄。句子的一些示例是:

  • “现年67岁的邦德先生是英国的工程师”
  • “现年34岁的阿曼达·拜恩斯是一位女演员”
  • “ Peter Parker(45)将成为我们的下一位管理员”
  • “迪伦先生今年46岁。”
  • “史蒂夫·琼斯,年龄:32岁,”

这些是我在数据集中识别出的一些模式。我想补充一点,还有其他模式,但是我还没有遇到它们,并且不确定如何实现。我编写了以下代码,效果很好,但是效率很低,因此在整个数据集上运行将花费太多时间。

#Create a search list of expressions that might come right before an age instance
age_search_list = [" " + last_name.lower().strip() + ", age ",
" " + clean_sec_last_name.lower().strip() + " age ",
last_name.lower().strip() + " age ",
full_name.lower().strip() + ", age ",
full_name.lower().strip() + ", ",
" " + last_name.lower() + ", ",
" " + last_name.lower().strip()  + " \(",
" " + last_name.lower().strip()  + " is "]

#for each element in our search list
for element in age_search_list:
    print("Searching: ",element)

    # retrieve all the instances where we might have an age
    for age_biography_instance in re.finditer(element,souptext.lower()):

        #extract the next four characters
        age_biography_start = int(age_biography_instance.start())
        age_instance_start = age_biography_start + len(element)
        age_instance_end = age_instance_start + 4
        age_string = souptext[age_instance_start:age_instance_end]

        #extract what should be the age
        potential_age = age_string[:-2]

        #extract the next two characters as a security check (i.e. age should be followed by comma, or dot, etc.)
        age_security_check = age_string[-2:]
        age_security_check_list = [", ",". ",") "," y"]

        if age_security_check in age_security_check_list:
            print("Potential age instance found for ",full_name,": ",potential_age)

            #check that what we extracted is an age, convert it to birth year
            try:
                potential_age = int(potential_age)
                print("Potential age detected: ",potential_age)
                if 18 < int(potential_age) < 100:
                    sec_birth_year = int(filing_year) - int(potential_age)
                    print("Filing year was: ",filing_year)
                    print("Estimated birth year for ",clean_sec_full_name,": ",sec_birth_year)
                    #Now, we save it in the main dataframe
                    new_sec_parser = pd.DataFrame([[clean_sec_full_name,"0","0",sec_birth_year,""]],columns = ['Name','Male','Female','Birth','Suffix'])
                    df_sec_parser = pd.concat([df_sec_parser,new_sec_parser])

            except ValueError:
                print("Problem with extracted age ",potential_age)

我有几个问题:

  • 有没有更有效的方法来提取这些信息?
  • 我应该改用正则表达式吗?
  • 我的文本文档很长,而且我有很多。我可以一次搜索所有商品吗?
  • 检测数据集中其他模式的策略是什么?

从数据集中提取的一些句子:

  • “ 2010年授予Love先生的股权奖励占其总薪酬的48%”
  • “ George F. Rubin(14)(15)年龄68受托人,始于:1997年。”
  • “现年56岁的INDRA K. NOOYI自2006年以来一直担任百事可乐首席执行官(CEO)”
  • “ 47岁的洛瓦洛先生于2011年被任命为财务主管。”
  • “现年79岁的Charles Baker先生是生物技术公司的商业顾问。”
  • “自成立以来,现年43岁的Botein先生一直是董事会的成员。”

5 个答案:

答案 0 :(得分:1)

import re 

x =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,"]

[re.findall(r'\d{1,3}', i)[0] for i in x] # ['67', '34', '45', '46', '32']

答案 1 :(得分:1)

这将适用于您提供的所有情况:https://repl.it/repls/NotableAncientBackground

import re 

input =["Mr Bond, 67, is an engineer in the UK"
,"Amanda B. Bynes, 34, is an actress"
,"Peter Parker (45) will be our next administrator"
,"Mr. Dylan is 46 years old."
,"Steve Jones, Age:32,", "Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation",
"George F. Rubin(14)(15) Age 68 Trustee since: 1997.",
"INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006",
"Mr. Lovallo, 47, was appointed Treasurer in 2011.",
"Mr. Charles Baker, 79, is a business advisor to biotechnology companies.",
"Mr. Botein, age 43, has been a member of our Board since our formation."]
for i in input:
  age = re.findall(r'Age[\:\s](\d{1,3})', i)
  age.extend(re.findall(r' (\d{1,3}),? ', i))
  if len(age) == 0:
    age = re.findall(r'\((\d{1,3})\)', i)
  print(i+ " --- AGE: "+ str(set(age)))

返回

Mr Bond, 67, is an engineer in the UK --- AGE: {'67'}
Amanda B. Bynes, 34, is an actress --- AGE: {'34'}
Peter Parker (45) will be our next administrator --- AGE: {'45'}
Mr. Dylan is 46 years old. --- AGE: {'46'}
Steve Jones, Age:32, --- AGE: {'32'}
Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation --- AGE: set()
George F. Rubin(14)(15) Age 68 Trustee since: 1997. --- AGE: {'68'}
INDRA K. NOOYI, 56, has been PepsiCos Chief Executive Officer (CEO) since 2006 --- AGE: {'56'}
Mr. Lovallo, 47, was appointed Treasurer in 2011. --- AGE: {'47'}
Mr. Charles Baker, 79, is a business advisor to biotechnology companies. --- AGE: {'79'}
Mr. Botein, age 43, has been a member of our Board since our formation. --- AGE: {'43'}

答案 2 :(得分:1)

由于必须处理文本,不仅要匹配模式,因此正确的方法是使用许多可用的 NLP 工具之一。

您的目标是使用命名实体识别(NER),该操作通常基于机器学习模型来完成。 NER活动尝试识别文本中确定的一组实体类型。例如:位置,日期,组织和人员名称

虽然不是100%精确,但这比简单模式匹配更精确(尤其是英语),因为它依赖于模式以外的其他信息,例如词性(POS),依赖项解析等。

看看我使用Allen NLP Online Tool(使用细粒度的NER模型)为您提供的短语所获得的结果:

  • “现年67岁的邦德先生是英国的工程师”:

Mr Bond, 67, is an engineer in the UK

  • “现年34岁的阿曼达·拜恩斯是一位女演员”

Amanda B. Bynes, 34, is an actress

  • “ Peter Parker(45)将成为我们的下一位管理员”

Peter Parker (45) will be our next administrator

  • “迪伦先生今年46岁。”

Mr. Dylan is 46 years old.

  • “史蒂夫·琼斯,年龄:32岁,”

Steve Jones, Age: 32,

请注意,这最后一个错误。如我所说,不是100%,但易于使用。

此方法的最大优势:您不必为数百万种可用可能性中的每一种都制作特殊的图案。

最好的事情:您可以将其集成到您的Python代码中:

pip install allennlp

并且:

from allennlp.predictors import Predictor
al = Predictor.from_path("https://s3-us-west-2.amazonaws.com/allennlp/models/fine- 
grained-ner-model-elmo-2018.12.21.tar.gz")
al.predict("Your sentence with date here")

然后,查看“日期”实体的结果字典。

Spacy也是如此:

!python3 -m spacy download en_core_web_lg
import spacy
sp_lg = spacy.load('en_core_web_lg')
{(ent.text.strip(), ent.label_) for ent in sp_lg("Your sentence with date here").ents}

(但是,我在那儿有一些不好的预言,尽管它被认为更好)。

有关更多信息,请阅读以下有关这篇有趣的文章的媒体:https://medium.com/@b.terryjack/nlp-pretrained-named-entity-recognition-7caa5cd28d7b

答案 3 :(得分:0)

从句子中查找一个人的年龄的一种简单方法是提取一个2位数的数字:

import re

sentence = 'Steve Jones, Age: 32,'
print(re.findall(r"\b\d{2}\b", 'Steve Jones, Age: 32,')[0])

# output: 32

如果您不希望%位于号码的末尾,并且您也想在开始时留一个空白,则可以这样做:

sentence = 'Equity awards granted to Mr. Love in 2010 represented 48% of his total compensation'

match = re.findall(r"\b\d{2}(?!%)[^\d]", sentence)

if match:
    print(re.findall(r"\b\d{2}(?!%)[^\d]", sentence)[0][:2])
else:
    print('no match')

# output: no match

对前面的句子也很好

答案 4 :(得分:0)

从您提供的示例来看,这是我建议的策略:

第1步:

检查语句正则表达式:(?i)(Age).*?(\d+)

中是否具有Age

上面将处理类似的示例

-乔治·鲁宾(14)(15)年龄68岁,自1997年以来。

-史蒂夫·琼斯(Steve Jones),年龄:32岁

第2步:

-检查“%”符号是否为句子,如果是,则删除带有符号的数字

-如果句子中没有“ Age”,则写一个正则表达式删除所有4位数字。正则表达式示例:\b\d{4}\b

-然后查看句子中是否还有数字,这就是您的年龄

所涵盖的示例如下

-2010年授予Love先生的股权奖励占其总薪酬的48%”-不会遗漏任何数字

-“现年56岁的INDRA K. NOOYI自2006年以来一直担任百事可乐首席执行官(CEO)” –仅剩56人

-“现年47岁的洛瓦洛先生于2011年被任命为财务主管。” -仅剩47张

这可能不是完整的答案,因为您也可以使用其他模式。但是,由于您要求提供策略和发布的示例,因此在所有情况下都可以使用