Question

我正在尝试使用正则表达式来标识不同学生的帖子。

帖子始终采用以下形式：

“ U3951583 \ n您好，我叫哈里。 http://www.harryresume.com。那是我的网站。 \ n U39501492 \ n这是一个很酷的网站。 \ n U5235098 \ n我也来看看”

因此，学生ID的长度可以为7-8个数字。
学生可以发帖任何东西。单词，数字，标点符号等
我们不知道有多少帖子将按人数统计。

我如何使用正则表达式创建一个列表，其中元素是按发布顺序排列的每个学生的帖子。

学生可以张贴任何内容，因此Im使用[\ s \ S] +进行捕捉。我的尝试是：re.findall('(U\d+\n[\s\S]+?)',text)。但是，这只会返回学生的ID，而不是他们的课文：['U3951583\n ', 'U39501492\n ', 'U5235098\n ']

在这种情况下如何使用正则表达式匹配？

Answer 1

您可以使用re.findall方法：

import re
txt = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
print(re.findall(r'\bU\d{7,8}\b.*?(?=\bU\d{7,8}\b|\Z)', txt, re.S))
# => ["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n ", "U39501492\n That's a cool website. \n ", "U5235098\n I'll have a look too"]

请参见Python demo

分别获取名称和内容的变体：

for name, content in re.findall(r'\b(U\d{7,8})\b(.*?)(?=\bU\d{7,8}\b|\Z)', txt, re.S):
    print("{}:{}".format(name.strip(), content.strip()))

输出：

U3951583:Hi there my name is Harry. Check out http://www.harryresume.com. That's my website.
U39501492:That's a cool website.
U5235098:I'll have a look too

请参见this Python demo

使用的正则表达式是

\b(U\d{7,8})\b(.*?)(?=\bU\d{7,8}\b|\Z)

请参见regex demo

详细信息

\b-单词边界（在当前位置的左侧不能出现字母/数字/ _）
(U\d{7,8})-第1组：U和7或8位数字
\b-单词边界
(.*?)-第2组：任意0个以上的字符，数量尽可能少
(?=\bU\d{7,8}\b|\Z)-一个正向的超前查询，要求在当前位置的右侧或字符串（|的末尾（\Z）处使用上述模式（名称模式）。

Python 3.7 +

在最新的Python版本中，您re.split的模式可以与空字符串匹配：

>>> import re
>>> txt = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. 
\n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
>>> print(re.split(r'(?!^)(?=\bU\d{7,8}\b)', txt))
["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n ", "U3
9501492\n That's a cool website. \n ", "U5235098\n I'll have a look too"]

因此，如果不需要分别获取名称和内容，这可能是一种更简单的方法。

Answer 2

您可以匹配U和7-8位数字，然后匹配不以同一模式开头的行。

\bU\d{7,8}(?:\r?\n(?![ ]*U\d{7}).*)*

说明

\bU\d{7,8}字边界，匹配U，后跟7到8位数字
(?:非捕获组
- \r?\n匹配换行符
- (?!负向前进，断言右边的不是
  - [ ]*\bU\d{7}匹配0+次空格，后跟单词边界，U和7位数字
- ).*结束负向查找并匹配任何char 0次以上
)*关闭非捕获组并重复0+次以匹配以下所有行

例如

import re

s = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
regex = r"\bU\d{7,8}(?:\r?\n(?![ ]*U\d{7}).*)*"

print(re.findall(regex, s))

结果

["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. ", "U39501492\n That's a cool website. ", "U5235098\n I'll have a look too"]

Regex demo | Python demo

Answer 3

尝试使用此正则表达式：

\d{7,8}

Here Is Demo

祝你好运！

如何使用正则表达式识别论坛上其他人的帖子？

3 个答案: