如何从Reddit帖子标题中提取年龄和性别?

时间:2019-06-30 13:01:30

标签: regex python-3.x nlp reddit

我正在尝试删除Reddit的subreddit帖子,这些帖子的形式很多:

s1 = "I [22M] and my partner (21F) are foo and bar"

s2 = "My (22m) and my partner (21m) are bar and foo"

我想创建一个可以解析每个字符串然后返回年龄和性别对的函数。所以:

def parse(s1):
 ....
 return [(22, "male"), (21, "female")]

基本上,每个年龄/性别标签都是一个两位数,后跟f, F, m, M

3 个答案:

答案 0 :(得分:0)

我们可以在此处尝试使用re.findall

s1 = "I [22m] and my partner (21F) are foo and bar"
matches = re.findall(r'(?:[\[(](\d+[MF])[\])])', s1, re.IGNORECASE)
print(matches)

[('22', 'm'), ('21', 'F')]

答案 1 :(得分:0)

您可以尝试使用此正则表达式提取匹配项:

(?:[\[\(])(\d{1,2})([MF])(?:[\]\)]) /i

Demo

对于python方面的东西,我建议使用re的{​​{3}}方法:

import re

def parse(title):
    return re.findall(r'(?:\[|\()(\d{1,2})([MF])(?:\]|\))', title, re.IGNORECASE)

title = 'I [22M] and my partner (21F) are foo and bar'
matches = parse(title)

print(matches)

findall

编辑:

您可以尝试对此进行正则表达式修改,以适应您在评论中提到的新要求:

(?:[\[\(])(\d{1,2})\s?([MF]|male|female)(?:[\]\)]) /i

Demo

答案 2 :(得分:0)

您可以将Regex与re一起使用:

import re
>>> re.findall(r'(?<=\[|\()[^\)\]]+', s1)  # find text within () or []
['22M', '21F']
>>> re.findall(r'\d+', '22M') # find age
['22']
>>> re.findall(r'[fFmM]+', '22M') # find gender
['M']

该网站非常适合在正则表达式上学习和实践:Demo