如何在字符串中找到双正则表达式子串?

时间:2017-01-27 15:36:48

标签: python regex string python-3.x substring

我想使用正则表达式在字符串中找到双子串。

示例:

line = "text 04/22/2014 text 04/22/2015 02/23/2014 more text 04/22/2014 more text 02/23/2014"

myregex= "\d\d/\d\d/\d\d\d\d"  

我知道如何检查正则表达式是否与字符串匹配:

mymatches = regex.findall(myregex, line)  
len(mymatches )

这将返回匹配列表的长度 如果列表是>1而不是字符串

中有双打

但我不知道的是找到相同字符串的双打,在上面的情况04/22/201404/22/2014中,并将它们放在嵌套列表中。<登记/> 示例输出:[['04/22/2014','04/22/2014'],['02/23/2014', '02/23/2014']]
如何找到相同正则表达式字符串的双精度数?

3 个答案:

答案 0 :(得分:2)

首先,我们将在该行中找到该模式的所有匹配项。然后我们将对它们进行排序并将相同的组合在一起。

import re
import itertools 

line = "text 04/22/2014 text 04/22/2015 02/23/2014 more text 04/22/2014 more text 02/23/2014"

pat = r'\d\d/\d\d/\d\d\d\d'
reg = re.compile(pat)
print([list(g) for k, g in itertools.groupby(sorted(reg.findall(line)))])

输出:

[['02/23/2014', '02/23/2014'], ['04/22/2014', '04/22/2014'], ['04/22/2015']]

编辑:如果你只想要那些出现两次或更多次的字符串,你可以做更像

的事情
[g for g in map(lambda x: list(x[1]), itertools.groupby(sorted(reg.findall(line)))) if len(g) > 1]

答案 1 :(得分:2)

import re

line = "text 04/22/2014 text 04/22/2015 02/23/2014 more text 04/22/2014 more text 02/23/2014"

datefreq = {}

p = re.compile(r'(\d{2}/\d{2}/\d{4})')
for f in p.findall(line):
    datefreq[f] = datefreq.setdefault(f, 0) + 1
for key in sorted(datefreq.keys()):
    print("{0}, {1}".format(key, datefreq[key]))

输出:

02/23/2014, 2
04/22/2014, 2
04/22/2015, 1

答案 2 :(得分:1)

您可以使用引用来执行此操作,您的正则表达式将如下所示:

myregex= r"(\d\d/\d\d/\d\d\d\d).*?\1"

其中\1指的是第一组(括号之间)[source]

因此,您搜索模式\d\d/\d\d/\d\d\d\d,然后采用任意数量的字符,后跟完全相同的模式。

但有一个问题:findall中的模式不应重叠。所以"04/22/2014 02/23/2014 04/22/2014 02/23/2014"。您可以使用search解决此问题:您要求搜索第一个元素,然后查看pos,然后查找start+1的下一个模式。类似的东西:

import re

myregex= re.compile(r"(\d\d/\d\d/\d\d\d\d).*?\1")
line = "04/22/2014 02/23/2014 14/5/1992 04/22/2014 02/23/2014"

pos = 0
result = []
while pos >= 0:
    srch=myregex.search(line,pos)
    if srch:
        result.append(srch.group(1))
        pos = srch.start()+1
    else:
        pos = -1

这给出了:

$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23) 
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>> 
>>> myregex= re.compile(r"(\d\d/\d\d/\d\d\d\d).*?\1")
>>> line = "04/22/2014 02/23/2014 14/5/1992 04/22/2014 02/23/2014"
>>> 
>>> pos = 0
>>> result = []
>>> while pos >= 0:
...     srch=myregex.search(line,pos)
...     if srch:
...         result.append(srch.group(1))
...         pos = srch.start()+1
...     else:
...         pos = -1
... 
>>> result
['04/22/2014', '02/23/2014']