从文本文件中提取名称和年龄

时间:2018-06-11 16:43:57

标签: regex python-3.x nlp data-extraction google-natural-language

我有一个.txt文件,我必须从中获取名称和年龄。 .txt文件的格式如下:

Age: 71 . John is 47 years old. Sam; Born: 05/04/1989(29).
Kenner is a patient Age: 36 yrs    Height: 5 feet 1 inch; weight is 56 kgs. 
This medical record is 10 years old. 

Output 1: John, Sam, Kenner
Output_2: 47, 29, 36  

我正在使用正则表达式来提取数据。例如,对于年龄,我使用以下正则表达式:

re.compile(r'age:\s*\d{1,3}',re.I)

re.compile(r'(age:|is|age|a|) \s*\d{1,3}(\s|y)',re.I)

re.compile(r'.* Age\s*:*\s*[0-9]+.*',re.I)

re.compile(r'.* [0-9]+ (?:year|years|yrs|yr) \s*',re.I)

我将另外的正则表达式应用于这些正则表达式的输出以提取数字。问题在于这些正则表达式,我也得到了我不想要的数据。例如

This medical record is 10 years old.

我从上面的句子得到'10',我不想要。 我只想提取人的名字和他们的年龄。我想知道应该采用什么方法?我会感激任何帮助。

1 个答案:

答案 0 :(得分:0)

请查看Cloud Data Loss Prevention API。这是一个带有示例的GitHub repo。这就是你可能想要的。

def inspect_string(project, content_string, info_types,
                   min_likelihood=None, max_findings=None, include_quote=True):
    """Uses the Data Loss Prevention API to analyze strings for protected data.
    Args:
        project: The Google Cloud project id to use as a parent resource.
        content_string: The string to inspect.
        info_types: A list of strings representing info types to look for.
            A full list of info type categories can be fetched from the API.
        min_likelihood: A string representing the minimum likelihood threshold
            that constitutes a match. One of: 'LIKELIHOOD_UNSPECIFIED',
            'VERY_UNLIKELY', 'UNLIKELY', 'POSSIBLE', 'LIKELY', 'VERY_LIKELY'.
        max_findings: The maximum number of findings to report; 0 = no maximum.
        include_quote: Boolean for whether to display a quote of the detected
            information in the results.
    Returns:
        None; the response from the API is printed to the terminal.
    """

    # Import the client library.
    import google.cloud.dlp

    # Instantiate a client.
    dlp = google.cloud.dlp.DlpServiceClient()

    # Prepare info_types by converting the list of strings into a list of
    # dictionaries (protos are also accepted).
    info_types = [{'name': info_type} for info_type in info_types]

    # Construct the configuration dictionary. Keys which are None may
    # optionally be omitted entirely.
    inspect_config = {
        'info_types': info_types,
        'min_likelihood': min_likelihood,
        'include_quote': include_quote,
        'limits': {'max_findings_per_request': max_findings},
      }

    # Construct the `item`.
    item = {'value': content_string}

    # Convert the project id into a full resource id.
    parent = dlp.project_path(project)

    # Call the API.
    response = dlp.inspect_content(parent, inspect_config, item)

    # Print out the results.
    if response.result.findings:
        for finding in response.result.findings:
            try:
                if finding.quote:
                    print('Quote: {}'.format(finding.quote))
            except AttributeError:
                pass
            print('Info type: {}'.format(finding.info_type.name))
            print('Likelihood: {}'.format(finding.likelihood))
    else:
        print('No findings.')