我正在尝试删除Reddit的subreddit帖子,这些帖子的形式很多:
s1 = "I [22M] and my partner (21F) are foo and bar"
s2 = "My (22m) and my partner (21m) are bar and foo"
我想创建一个可以解析每个字符串然后返回年龄和性别对的函数。所以:
def parse(s1):
....
return [(22, "male"), (21, "female")]
基本上,每个年龄/性别标签都是一个两位数,后跟f, F, m, M
。
答案 0 :(得分:0)
我们可以在此处尝试使用re.findall
s1 = "I [22m] and my partner (21F) are foo and bar"
matches = re.findall(r'(?:[\[(](\d+[MF])[\])])', s1, re.IGNORECASE)
print(matches)
[('22', 'm'), ('21', 'F')]
答案 1 :(得分:0)
您可以尝试使用此正则表达式提取匹配项:
(?:[\[\(])(\d{1,2})([MF])(?:[\]\)]) /i
对于python方面的东西,我建议使用re
的{{3}}方法:
import re
def parse(title):
return re.findall(r'(?:\[|\()(\d{1,2})([MF])(?:\]|\))', title, re.IGNORECASE)
title = 'I [22M] and my partner (21F) are foo and bar'
matches = parse(title)
print(matches)
编辑:
您可以尝试对此进行正则表达式修改,以适应您在评论中提到的新要求:
(?:[\[\(])(\d{1,2})\s?([MF]|male|female)(?:[\]\)]) /i
答案 2 :(得分:0)
您可以将Regex与re
一起使用:
import re
>>> re.findall(r'(?<=\[|\()[^\)\]]+', s1) # find text within () or []
['22M', '21F']
>>> re.findall(r'\d+', '22M') # find age
['22']
>>> re.findall(r'[fFmM]+', '22M') # find gender
['M']
该网站非常适合在正则表达式上学习和实践:Demo