我想使用正则表达式在字符串中找到双子串。
示例:
line = "text 04/22/2014 text 04/22/2015 02/23/2014 more text 04/22/2014 more text 02/23/2014"
myregex= "\d\d/\d\d/\d\d\d\d"
我知道如何检查正则表达式是否与字符串匹配:
mymatches = regex.findall(myregex, line)
len(mymatches )
这将返回匹配列表的长度
如果列表是>1
而不是字符串
但我不知道的是找到相同字符串的双打,在上面的情况04/22/2014
和04/22/2014
中,并将它们放在嵌套列表中。<登记/>
示例输出:[['04/22/2014','04/22/2014'],['02/23/2014', '02/23/2014']]
如何找到相同正则表达式字符串的双精度数?
答案 0 :(得分:2)
首先,我们将在该行中找到该模式的所有匹配项。然后我们将对它们进行排序并将相同的组合在一起。
import re
import itertools
line = "text 04/22/2014 text 04/22/2015 02/23/2014 more text 04/22/2014 more text 02/23/2014"
pat = r'\d\d/\d\d/\d\d\d\d'
reg = re.compile(pat)
print([list(g) for k, g in itertools.groupby(sorted(reg.findall(line)))])
输出:
[['02/23/2014', '02/23/2014'], ['04/22/2014', '04/22/2014'], ['04/22/2015']]
编辑:如果你只想要那些出现两次或更多次的字符串,你可以做更像
的事情[g for g in map(lambda x: list(x[1]), itertools.groupby(sorted(reg.findall(line)))) if len(g) > 1]
答案 1 :(得分:2)
import re
line = "text 04/22/2014 text 04/22/2015 02/23/2014 more text 04/22/2014 more text 02/23/2014"
datefreq = {}
p = re.compile(r'(\d{2}/\d{2}/\d{4})')
for f in p.findall(line):
datefreq[f] = datefreq.setdefault(f, 0) + 1
for key in sorted(datefreq.keys()):
print("{0}, {1}".format(key, datefreq[key]))
输出:
02/23/2014, 2
04/22/2014, 2
04/22/2015, 1
答案 2 :(得分:1)
您可以使用引用来执行此操作,您的正则表达式将如下所示:
myregex= r"(\d\d/\d\d/\d\d\d\d).*?\1"
其中\1
指的是第一组(括号之间)[source]。
因此,您搜索模式\d\d/\d\d/\d\d\d\d
,然后采用任意数量的字符,后跟完全相同的模式。
但有一个问题:findall
中的模式不应重叠。所以"04/22/2014 02/23/2014 04/22/2014 02/23/2014"
。您可以使用search
解决此问题:您要求搜索第一个元素,然后查看pos
,然后查找start+1
的下一个模式。类似的东西:
import re
myregex= re.compile(r"(\d\d/\d\d/\d\d\d\d).*?\1")
line = "04/22/2014 02/23/2014 14/5/1992 04/22/2014 02/23/2014"
pos = 0
result = []
while pos >= 0:
srch=myregex.search(line,pos)
if srch:
result.append(srch.group(1))
pos = srch.start()+1
else:
pos = -1
这给出了:
$ python3
Python 3.5.2 (default, Nov 17 2016, 17:05:23)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import re
>>>
>>> myregex= re.compile(r"(\d\d/\d\d/\d\d\d\d).*?\1")
>>> line = "04/22/2014 02/23/2014 14/5/1992 04/22/2014 02/23/2014"
>>>
>>> pos = 0
>>> result = []
>>> while pos >= 0:
... srch=myregex.search(line,pos)
... if srch:
... result.append(srch.group(1))
... pos = srch.start()+1
... else:
... pos = -1
...
>>> result
['04/22/2014', '02/23/2014']