python re.findall模式,用于不同数量的匹配

时间:2016-02-24 21:08:48

标签: python regex findall

//this button is on page A
var btnA = document.getElementById("enterA");
btnA.onclick = function () {
    "use strict";
    window.console.log("Button A Pressed");
    window.open("/pageB.html", "_self");
};

//this button is on page B
var btnB = document.getElementById("enterB");
btnB.onclick = function () {
    "use strict";
    window.console.log("Button B Pressed");
    window.open("/pageC.html", "_self");
};

//this button is on page C
var btnC = document.getElementById("enterC");
btnC.onclick = function () {
    "use strict";
    window.console.log("Button C Pressed");
    window.alert("Button C Pressed");
};

我的模式:<tr> 11:15 12:15 13:15 </tr> <tr> 18:15 19:15 20:15 </tr> in this case output should be: [ (11:15, 12:15, 13:15), (18:15, 19:15, 20:15) ] 只有在每个tr标签中有3个小时时才会起作用

但是如果每个tr标签中有1-3个小时(格式相同\ d \ d:\ d \ d),这应该可行。 另一个例子。为此,我的模式不再起作用了。

(\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?(\d\d:\d\d)[\s\S]*?</tr>

这是另一回事:每小时都不会被空格分开,真正的文件看起来像这样: 我使用了<tr>12:00 13:00</tr> <tr>14:00 15:00 16:00</tr> <tr>12:00</tr> Output should be: [ (12:00, 13:00, ), (14:00, 15:00, 16:00), (12:00, , ) ] 。一小时是简单跨度或更长的形式

示例:

[\s\S]*? or  [\w\s<>="-/:;?|]*?

3 个答案:

答案 0 :(得分:1)

我会使用 HTML解析器解析HTML,找到tr中的所有table元素并使用str.split()拆分内容或每行 - 它将处理空格和换行符。使用BeautifulSoup parser的示例:

from bs4 import BeautifulSoup

data = """
<table>
    <tr>
    11:15
    12:15
    13:15
    </tr>

    <tr>
    18:15
    19:15
    20:15
    </tr>

    <tr>12:00 13:00</tr>
    <tr>14:00 15:00 16:00</tr>
    <tr>12:00</tr>
</table>"""

soup = BeautifulSoup(data, "html.parser")

result = [row.text.split() for row in soup.table.find_all("tr")]
print(result)

打印:

[['11:15', '12:15', '13:15'], 
 ['18:15', '19:15', '20:15'], 
 ['12:00', '13:00'], 
 ['14:00', '15:00', '16:00'], 
 ['12:00']]
  

一小时是简单跨度或更长的形式。

这更好,让我们找到匹配特定模式的tr内的每个内部元素并获取文本

[[elm.strip() for elm in row.find_all(text=re.compile(r"\d\d:\d\d"))] 
 for row in soup.table.find_all("tr")]

答案 1 :(得分:0)

如果您更喜欢正则表达式,可以使用:

found = []
for group in re.findall(r'(\d\d:\d\d.*){1,3}</tr>', data, re.DOTALL):
    found.append(re.findall(r'(\d\d:\d\d)', group, re.DOTALL))
# found == [['12:00', '13:00'], ['14:00', '15:00', '16:00'], ['12:00']]

答案 2 :(得分:0)

使用正则表达式尝试此解决方案:

allele2

输出:

import re

input = """
<tr>
11:15
12:15
13:15
</tr>

<tr>
18:15
19:15
20:15
</tr>

<tr>12:00 13:00</tr>
<tr>14:00 15:00 16:00</tr>
<tr>12:00</tr>
"""

print [ re.findall('(\d\d:\d\d)', tr) for tr in re.findall('<tr>([^<]*)</tr>', input)]