Question

我知道你通常不应该在HTML中使用Regex。我正在使用它作为一次性工具从一个具有常量模式的文件中快速剥离一些数据，它将永远不会再次使用。我想用Regex来完成这项任务。 我知道您不应该使用Regex解析HTML。
不，我不想使用XMl Parser，BeautifulSoup，lxml等。谢谢。：）
我只想让这一次使用，并永远完成它。

话虽这么说，我写的正则表达式只匹配文件中的最后一个“匹配”。我不知道为什么。该文件具有相当恒定的模式：

<p someAttribute="yes"><b someOtherAttribute="no">My Title - </b> My Description</p>
<p someAttribute="yes"><b someOtherAttribute="no">My 2nd Title - </b> My 2nd Description</p>
<p someAttribute="yes"><b someOtherAttribute="no">My 3rd Title - </b> My 3rd Description</p>
<p class="normal" style="margin-left:1"><b style="font-weight:400">Another one </b>The cake is a lie</p>

我不关心这些属性。我正在尝试对<b>标记中的内容进行分组，以及后面的内容。标题和说明。

def parseData(html):
    pattern = re.compile('.*<p.*><b.*>(.+)</b>(.+)</p>.*')

    matches = re.findall(pattern, str(html))

    for match in matches:
        print(match)

def main():
    htmlFile = "myFile.htm"

    browser = UrlBrowser()

    parseData(browser.getHTML(htmlFile))

此模式仅匹配上次可用的“匹配” - 我之前尝试添加.*以查看是否会出现问题，但这并没有什么区别。我在正则表达式上缺少什么？

Answer 1

这应该这样做。见working demo

matches = re.findall(r'<b[^>]+>(.*?)</b>(.*?)</p>', str)

正则表达式：

<b            match start of tag '<b'
 [^>]+        any character except: '>' (1 or more times)
 >            match enclosed '>'
 (            group and capture to \1:
  .*?         any character except \n (0 or more times)
 )            end of \1
 </b>         match '</b>'
 (            group and capture to \2:
  .*?         any character except \n (0 or more times)
 )            end of \2
 </p>         match '</p>'

您正在使用贪婪的.*（匹配尽可能多的数量）。您希望将?添加到其末尾，使其变得非贪婪（匹配可能的最少量）

从re文档中获取解释，讨论以下量词?，+?，??

*，'+'和'？'资格赛都是贪心的;他们匹配得那么多文本尽可能。有时这种行为是不可取的;如果有＆LT; ＆GT;匹配'＆lt; H1＆gt;标题＆lt; / H1＆gt;'，它将匹配整个字符串，而不仅仅是'＆lt; H1＆gt;'。添加'？'在资格赛结束后以非贪婪或极简的方式进行比赛;少数人物尽可能匹配。使用。？在前面的表达式中仅匹配'＆lt; H1＆gt;'。

Answer 2

这是你的领先。*导致最后一场比赛。 *和+限定符将尽可能多地匹配前一项，同时仍然产生匹配

使用“非贪婪”*？代替每个*，+？代替每个+来获得产生匹配的最短序列。

请参阅：http://docs.python.org/3.3/library/re.html#regular-expression-syntax

Answer 3

还有更多事情发生。

import re

data = """\
<p someAttribute="yes"><b someOtherAttribute="no">My Title - </b> My Description</p>
<p someAttribute="yes"><b someOtherAttribute="no">My 2nd Title - </b> My 2nd Description</p>
<p someAttribute="yes"><b someOtherAttribute="no">My 3rd Title - </b> My 3rd Description</p>
<p class="normal" style="margin-left:1"><b style="font-weight:400">Another one </b>The cake is a lie</p>"""

print(*re.findall('.*<p.*><b.*>(.+)</b>(.+)</p>.*', data), sep="\n")
#>>> ('My Title - ', ' My Description')
#>>> ('My 2nd Title - ', ' My 2nd Description')
#>>> ('My 3rd Title - ', ' My 3rd Description')
#>>> ('Another one ', 'The cake is a lie')

请注意，您在开头和结尾不需要.*：

print(*re.findall('<p.*><b.*>(.+)</b>(.+)</p>', data), sep="\n")
#>>> ('My Title - ', ' My Description')
#>>> ('My 2nd Title - ', ' My 2nd Description')
#>>> ('My 3rd Title - ', ' My 3rd Description')
#>>> ('Another one ', 'The cake is a lie')

'因为正则表达式已经在整个字符串中搜索匹配项。

你可能也想要非贪婪的重复，但我不认为这是问题所在：

print(*re.findall('<p.*?><b.*?>(.+?)</b>(.+?)</p>', data), sep="\n")
#>>> ('My Title - ', ' My Description')
#>>> ('My 2nd Title - ', ' My 2nd Description')
#>>> ('My 3rd Title - ', ' My 3rd Description')
#>>> ('Another one ', 'The cake is a lie')

为什么这个正则表达式只找到最后一个可用的匹配？

3 个答案: