在Python正则表达式预测中匹配组

时间:2015-06-14 01:30:31

标签: python regex

我有一个来自Wordpress博客的原始文本数据下载,结构如下:

POST_ID_1 TITLE_1 DATE_1

This is the text from the first post ..

POST_ID_2 TITLE_2 DATE_2

This is the text from the second post ..

我写了一些正则表达式来捕获POST_IDTITLEDATE。我的目标是创建一个Python字典,结构如下:

posts = {'DATE_1': {'post_id': POST_ID_1,
                    'title': TITLE_1,
                    'text': 'This is the text from the first post ..'
                    }
        }

捕获标题(POST_IDTITLEDATE)的正则表达式如下:

header_regex_raw = r"""(\d+)\s(.*(?=January|February|March|April|May|June|July|August|September|October|November|December))(January|February|March|April|May|June|July|August|September|October|November|December)(\s\d+\,\s\d{4}\b)"""

我的想法是做re.findall(header_regex_raw + (.*(?={})).format(header_regex_raw)这样的事情,但不幸的是,这并没有按计划进行。

如何在前瞻中捕获多个组?什么是创建上述词典的更好方法?

1 个答案:

答案 0 :(得分:1)

我在Python re模块中找到了一个干净的函数:re.split

header_regex_raw = r"""(\d+)\s(.+?(?=January|February|March|April|May|June|July|August|September|October|November|December))((January|February|March|April|May|June|July|August|September|October|November|December)(\s\d+\,\s\d{4}\b))"""
header_text_header = re.compile(header_regex_raw)
ret = header_text_header.split(data.strip())

这正是我想要的:它捕获组中的标题元素,另一组中的文本,组中的以下标题元素等。