Question

我从\s获得了一些内容，我试图在其上运行正则表达式，以便在\r中提取一些信息。但我无法让正则表达式进一步展望;我认为该文档有新的方法正在阻碍。我尝试过添加The content was pretty nice and would participate again 或(?<=showPollResponses\()(.*)(?=)，但它并不适合我。

我试图检索

</thead>
<tr>
<td class="oddpoll" style="width:20%"><b><a href="#" onclick="showPollResponses(123456, 99, '1A2B3C4D5E6F7G8H9I0J1K2L3M4N5O6P', 123456, 123456, 99);return false;">The stuf (i</a></b>
<br>
</td><td class="oddpoll" style="width:35%">The content was pretty nice and would participate again&nbsp;</td><td class="oddpoll" style="width:45%"><b>123 Total</b>
<br>
</td>
</tr>
<tr>
<td class="oddpoll">&nbsp;</td>

使用正则表达式：

(?<=showPollResponses\()(.*)(?=width:45%)

以下是该文件的一个示例：

urllib

我已尝试使用-uint_ % ','，但它没有返回任何内容。我打算将这一大块html和正则表达式进一步提取最终文本。

这是我的regex101.com

这不是一个更简单的方法，是吗？在PHP中，我使用工具来使用css选择器来抓取数据，因此我可以轻松地以这种方式检索它。或者在vector<T>上下文中，使用正则表达式是唯一的方法吗？感谢您提供的任何帮助。

Answer 1

使用正则表达式解析HTML是一件非常有争议的事情 - 它有时只是合理的：RegEx match open tags except XHTML self-contained tags。

更好的方法是使用专门的工具 - 像BeautifulSoup这样的 HTML解析器。我们的想法是通过a属性的部分匹配找到onclick元素，然后在td之后获取下一个a元素：

from bs4 import BeautifulSoup

data = """
<table>
    </thead>
        <tr>
            <td class="oddpoll" style="width:20%"><b><a href="#" onclick="showPollResponses(123456, 99, '1A2B3C4D5E6F7G8H9I0J1K2L3M4N5O6P', 123456, 123456, 99);return false;">The stuf (i</a></b>
            <br>
            </td><td class="oddpoll" style="width:35%">The content was pretty nice and would participate again&nbsp;</td><td class="oddpoll" style="width:45%"><b>123 Total</b>
            <br>
            </td>
        </tr>
        <tr>
    </thead>
</table>"""

soup = BeautifulSoup(data, "html.parser")

print(soup.select_one("a[onclick*=showPollResponses]").find_next("td").get_text())

打印：

The content was pretty nice and would participate again

Answer 2

您的问题出在(.*)。 .仅匹配字符，因此不包含换行符。解决此问题的方法是使用([\s\S]*)。所以，如果没有过多地修改你的正则表达式，(?<=showPollResponses\()([\S\s]*)(?=width:45%)。

编辑：由于您的正则表达式匹配过去(?=width:45%)，我会做出有根据的猜测，它会在您的文档中稍后再次出现。由于([\s\S]*)是贪婪的，它会尽可能多地匹配。为了解决这个问题，我们可以添加?以匹配第一次迭代。现在，(?<=showPollResponses\()([\S\s]*?)(?=width:45%)。

Python3正则表达式 - 积极前瞻不做多行

2 个答案: