提取分隔符之间具有特定长度的整数

时间:2019-03-11 07:16:39

标签: python regex string findall

给出如下字符串列表:

$ ./start.sh
Welcome to golang

我需要提取分隔符L = ['1759@1@83@0#1362@0.2600@25.7400@2.8600#1094@1@129.6@14.4', '1356@0.4950@26.7300@2.9700', '1354@1.78@35.244@3.916#1101@2@40@0#1108@2@30@0', '1430@1@19.35@2.15#1431@3@245.62@60.29#1074@12@385.2@58.8#1109', '1809@8@75.34@292.66#1816@4@24.56@95.44#1076@47@510.89@1110.61'] #之间长度为4的所有整数,还要提取第一个和最后一个整数。没有浮空。

我的解决方案有点复杂-用空格替换,然后应用this解决方案:

@

是否可以更改正则表达式,使其不必使用pat = r'(?<!\S)\d{4}(?!\S)' out = [re.findall(pat, re.sub('[#@]', ' ', x)) for x in L] print (out) """ [['1759', '1362', '1094'], ['1356'], ['1354', '1101', '1108'], ['1430', '1431', '1074', '1109'], ['1809', '1816', '1076']] """ 进行替换?还有另一种性能更好的解决方案吗?

3 个答案:

答案 0 :(得分:5)

要允许没有前导或尾随分隔符的第一次和最后一次出现,您可以使用否定环视:

(?<![^#])\d{4}(?![^@])

(?<![^#])(?:^|#) near 同义词。否定超前同样如此。

查看实时demo here

答案 1 :(得分:3)

有趣的问题!

这可以通过先行和后行的概念轻松解决。

输入

pattern = "(?<!\.)(?<=[#@])\d{4}|(?<!\.)\d{4}(?=[@#])"
out = [re.findall(pattern, x) for x in L]
print (out)

输出

[['1759', '1362', '1094', '1234'],
 ['1356'],
 ['1354', '1101', '1108'],
 ['1430', '1431', '1074', '1109'],
 ['1809', '1816', '1076', '1110']]

EXPLANATION

上面的模式是由 | (或运算符)分隔的两个独立模式的组合。

pattern_1 = "(?<!\.)(?<=[#@])\d{4}"
\d{4}     --- Extract exactly 4 digits
(?<!\.)   --- The 4 digits must not be preceded by a period(.) NEGATIVE LOOKBEHIND
(?<=[#@]) --- The 4 digits must be preceded by a hashtag(#) or at(@) POSITIVE LOOKBEHIND

pattern_2 = "(?<!\.)\d{4}(?=[@#])"
\d{4}     --- Extract exactly 4 digits
(?<!\.)   --- The 4 digits must not be preceded by a period(.) NEGATIVE LOOKBEHIND
(?=[@#]   --- The 4 digits must be followed by a hashtag(#) or at(@) POSITIVE LOOKAHEAD

为了更好地理解这些概念,click here

答案 2 :(得分:1)

如果您也考虑长度为4的整数也没有开头#或结尾@的情况,这是不使用正则表达式的复杂列表理解:

[[n for o in p for n in o] for p in [[[m for m in k.split("@") if m.isdigit() and str(int(m))==m and len(m) ==4] for k in j.split("#")] for j in L]]

输出

[['1759', '1362', '1094'], ['1356'], ['1354', '1101', '1108'], ['1430', '1431', '1074', '1109'], ['1809', '1816', '1076']]