如何将正则表达式函数应用于数据框列以返回值

时间:2019-11-21 23:24:05

标签: python regex

我正在尝试将正则表达式函数应用于数据框的一列,以确定性别代词。这是我的数据框的样子:

    name                                            Descrip
0  Sarah           she doesn't like this because her mum...
1  David                 he does like it because his dad...
2    Sam  they generally don't like it because their par...

这些是我为制作该数据框而运行的代码:

list_label = ["Sarah", "David", "Sam"]
list_descriptions = ["she doesn't like this because her mum...", "he does like it because his dad...", "they generally don't like it because their parent..."]

data3 = {'name':list_label, 'Descrip':list_descriptions}
test_df = pd.DataFrame(data3)

我正在尝试通过在“描述”列上应用正则表达式功能来确定此人的性别。具体来说,以下是我要实现的模式:

"male":"(he |his |him )",
"female":"(she |her |hers )",
"plural, or singular non-binary":"(they |them |their )"

我编写的完整代码如下:

此函数尝试匹配每个模式,并返回在行值描述中最常提及的性别代词的名称。每个性别代词在模式字符串中都有几个关键词(例如,他,她,他们)。其想法是确定max_gender,或与在描述列中的值中最常提及的模式组相关的性别。因此,max_gender可以采用以下三个值之一:male |女|复数或单数非二进制。如果在整个Descrip行的值中均未识别出任何模式,则将返回“未知”。

import re
def get_pronouns(text):
    patterns = {
        "male":"(he |his |him )",
        "female":"(she |her |hers )",
        "plural, or singular non-binary":"(they |them |their )"
    }
    max_gender = "unknown"
    max_gender_count = 0
    for gender in patterns:
        pattern = re.compile(gender)
        mentions = re.findall(pattern, text)
        count_mentions = len(mentions)
        if count_mentions > max_gender_count:
            max_gender_count = count_mentions
            max_gender = gender
    return max_gender

test_df["pronoun"] = test_df.loc[:, "Descrip"].apply(get_pronouns)
print(test_df)

但是,当我运行代码时,它显然无法确定性别代词。在以下输出中显示:

    name                                            Descrip  pronoun
0  Sarah           she doesn't like this because her mum...  unknown
1  David                 he does like it because his dad...  unknown
2    Sam  they generally don't like it because their par...  unknown

有人知道我的代码有什么问题吗?

1 个答案:

答案 0 :(得分:2)

如果您想发现代码为什么不起作用,可以像下面这样在函数中添加一条打印语句:

    for gender in patterns:
        print(gender)
        pattern = re.compile(gender)

您的正则表达式也需要一些调整。例如,在平克·弗洛伊德(Pink Floyd)的歌曲《呼吸》中,呼吸,呼吸的第一行中,您的正则表达式会找到两个男性代词。

我不确定还有其他问题。


这是与您非常相似的解决方案。正则表达式是固定的,字典由元组列表等代替。


解决方案代码

import pandas as pd
import numpy as np
import re
import operator as op

names_list = ['Sarah', 'David', 'Sam']
descs_list = ["she doesn't like this because her mum...", 'he does like it because his dad...',
              "they generally don't like it because their parent..."]

df_1 = pd.DataFrame(data=zip(names_list, descs_list), columns=['Name', 'Desc'])

pronoun_re_list = [('male', re.compile(r"\b(?:he|his|him)\b", re.IGNORECASE)),
                   ('female', re.compile(r"\b(?:she|her|hers)\b", re.IGNORECASE)),
                   ('plural/nb', re.compile(r"\b(?:they|them|their)\b", re.IGNORECASE))]


def detect_pronouns(str_in: str) -> str:
    match_results = ((curr_pron, len(curr_patt.findall(str_in))) for curr_pron, curr_patt in pronoun_re_list)
    max_pron, max_counts = max(match_results, key=op.itemgetter(1))
    if max_counts == 0:
        return np.NaN
    else:
        return max_pron


df_1['Pronouns'] = df_1['Desc'].map(detect_pronouns)

说明

代码

match_results生成器表达式curr_pron代表“当前代词”,curr_patt代表“当前模式”。如果我将它重写为创建列表的for循环,则可能会使事情更清楚:

    match_results = []
    for curr_pron, curr_patt in pronoun_re_list:
        match_counts = len(curr_patt.findall(str_in))
        match_results.append((curr_pron, match_counts))

for curr_pron, curr_patt in ...利用了几个不同名称的东西,通常是多重分配或元组拆包。您可以在here上找到一篇不错的文章。在这种情况下,这只是一种不同的书写方式:

    for curr_tuple in pronoun_re_list:
        curr_pron = curr_tuple[0]
        curr_patt = curr_tuple[1]

RegEx

每个人最喜欢的主题的时间;正则表达式!我使用了一个名为RegEx101的出色网站,您可以在此混乱那里的模式,这使事情更容易理解。我已经建立了一个包含一些测试数据和正则表达式的页面,我将在下面进行介绍:https://regex101.com/r/Y1onRC/2

现在,让我们看一下我使用的正则表达式:\b(?:he|his|him)\b

he|his|him部分与您的部分完全相同,它与单词“ he”,“ his”或“ him”匹配。在用括号括起来的正则表达式中,我的左括号后还包括?:(pattern stuff)capturing group,顾名思义,意味着它捕获任何匹配项。由于这里我们实际上并不关心匹配的内容,所以只关心是否存在匹配,因此我们添加?:来创建一个非捕获组,该组不捕获(或保存)内容。

我说过,正则表达式的he|his|him部分与您的正则表达式相同,但事实并非如此。您应在每个代词后添加一个空格,以免它与单词中间的字母he匹配。不幸的是,正如我上面提到的,它在句子 Breathe,呼吸中中找到两个匹配项。我们的救星是\b,与word boundaries相匹配。这意味着我们在单词单词he。中捕获了 he 。而(he |his |him )却没有。

最后,我们使用re.IGNORECASE标志编译模式,我认为并不需要太多解释,尽管如果我错了,请告诉我。

这是我用简单的英语描述两种模式的方法:

  • (he |his |him )匹配字母 he 后跟一个空格, his 后跟一个空格,或 him 后跟一个空格,并返回完整匹配项和一个分组。
  • 带有\b(?:he|his|him)\b标志的
  • re.IGNORECASE与单词 he his him 匹配,无论大小写,并返回完整匹配项。

希望很清楚,让我知道!


结果输出

    Name    Desc                                                  Pronouns
--  ------  ----------------------------------------------------  ----------
 0  Sarah   she doesn't like this because her mum...              female
 1  David   he does like it because his dad...                    male
 2  Sam     they generally don't like it because their parent...  plural/nb

让我知道您是否有任何问题:)