我正在开展一个个人项目,并坚持提取月份缩写的文字。
示例输入文本的格式为:
text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"
我希望输出表格:
[ ("apr25, 2016\nblah blah\npow\n"), ("may22, 2017\nasdf rtys\nqwer\n"), ("jan9, 2018\npoiu\nlkjhj yertt") ]
我尝试了一个简单的正则表达式,但它不正确:
import re
# Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*)|(may[\w\W]*)|(jan[\w\W]*)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt', '', '')]
# Non-Greedy version
REGEX_MONTHS_TEXT = re.compile(r'(apr[\w\W]*?)|(may[\w\W]*?)|(jan[\w\W]*?)')
REGEX_MONTHS_TEXT.findall(text)
# output: [('apr', '', ''), ('', 'may', ''), ('', '', 'jan')]
你能帮我用python3正则表达式产生所需的输出吗?
或者我是否需要编写自定义python3代码来生成所需的输出?
答案 0 :(得分:1)
问题在于我的正则表达式中的月缩写,在匹配月份缩写后停止。
我提到了Python RegEx Stop before a Word并使用了那里提到的驯化贪婪令牌解决方案。
import re
REGEX_MONTHS_TEXT = re.compile(r'(apr|may|jan)((?:(?!apr|may|jan)[\w\W])+)')
text = "apr25, 2016\nblah blah\npow\nmay22, 2017\nasdf rtys\nqwer\njan9, 2018\npoiu\nlkjhj yertt"
arr = REGEX_MONTHS_TEXT.findall(text)
# arr = [ ('apr', '25, 2016\nblah blah\npow\n'), ('may', '22, 2017\nasdf rtys\nqwer\n'), ('jan', '9, 2018\npoiu\nlkjhj yertt')]
# The above arr can be combined using list comprehension to form
# list of singleton tuples as expected in the original question
output = [ (x + y,) for (x, y) in arr ]
# output = [('apr25, 2016\nblah blah\npow\n',), ('may22, 2017\nasdf rtys\nqwer\n',), ('jan9, 2018\npoiu\nlkjhj yertt',)]
驯化贪婪令牌的附加资源:Tempered Greedy Token - What is different about placing the dot before the negative lookahead