我正在尝试使用正则表达式来标识不同学生的帖子。
帖子始终采用以下形式:
“ U3951583 \ n您好,我叫哈里。 http://www.harryresume.com。那是我的网站。 \ n U39501492 \ n这是一个 很酷的网站。 \ n U5235098 \ n我也来看看”
我如何使用正则表达式创建一个列表,其中元素是按发布顺序排列的每个学生的帖子。
学生可以张贴任何内容,因此Im使用[\ s \ S] +进行捕捉。我的尝试是:re.findall('(U\d+\n[\s\S]+?)',text)
。但是,这只会返回学生的ID,而不是他们的课文:['U3951583\n ', 'U39501492\n ', 'U5235098\n ']
在这种情况下如何使用正则表达式匹配?
答案 0 :(得分:4)
您可以使用re.findall
方法:
import re
txt = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
print(re.findall(r'\bU\d{7,8}\b.*?(?=\bU\d{7,8}\b|\Z)', txt, re.S))
# => ["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n ", "U39501492\n That's a cool website. \n ", "U5235098\n I'll have a look too"]
请参见Python demo
分别获取名称和内容的变体:
for name, content in re.findall(r'\b(U\d{7,8})\b(.*?)(?=\bU\d{7,8}\b|\Z)', txt, re.S):
print("{}:{}".format(name.strip(), content.strip()))
输出:
U3951583:Hi there my name is Harry. Check out http://www.harryresume.com. That's my website.
U39501492:That's a cool website.
U5235098:I'll have a look too
使用的正则表达式是
\b(U\d{7,8})\b(.*?)(?=\bU\d{7,8}\b|\Z)
请参见regex demo
详细信息
\b
-单词边界(在当前位置的左侧不能出现字母/数字/ _
)(U\d{7,8})
-第1组:U
和7或8位数字\b
-单词边界(.*?)
-第2组:任意0个以上的字符,数量尽可能少(?=\bU\d{7,8}\b|\Z)
-一个正向的超前查询,要求在当前位置的右侧或字符串(|
的末尾(\Z
)处使用上述模式(名称模式)。 Python 3.7 +
在最新的Python版本中,您re.split
的模式可以与空字符串匹配:
>>> import re
>>> txt = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website.
\n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
>>> print(re.split(r'(?!^)(?=\bU\d{7,8}\b)', txt))
["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n ", "U3
9501492\n That's a cool website. \n ", "U5235098\n I'll have a look too"]
因此,如果不需要分别获取名称和内容,这可能是一种更简单的方法。
答案 1 :(得分:2)
您可以匹配U和7-8位数字,然后匹配不以同一模式开头的行。
\bU\d{7,8}(?:\r?\n(?![ ]*U\d{7}).*)*
说明
\bU\d{7,8}
字边界,匹配U,后跟7到8位数字(?:
非捕获组
\r?\n
匹配换行符(?!
负向前进,断言右边的不是
[ ]*\bU\d{7}
匹配0+次空格,后跟单词边界,U和7位数字).*
结束负向查找并匹配任何char 0次以上)*
关闭非捕获组并重复0+次以匹配以下所有行例如
import re
s = "U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. \n U39501492\n That's a cool website. \n U5235098\n I'll have a look too"
regex = r"\bU\d{7,8}(?:\r?\n(?![ ]*U\d{7}).*)*"
print(re.findall(regex, s))
结果
["U3951583\n Hi there my name is Harry. Check out http://www.harryresume.com. That's my website. ", "U39501492\n That's a cool website. ", "U5235098\n I'll have a look too"]
答案 2 :(得分:0)