Question

我是正则表达式/ Python的新手，但我正在尝试从HTML页面中提取修订版号。我使用代理和urllib将读取的页面存储到字符串中。我有一些文字看起来像：

<p>Proxy 3.2.1 r72440<br>
SlotBios 11.00</p>
<p><strong><span style="color: rgb(255, 0, 0);">Random Text 4.23.6 r98543<br>
...</tr>...
<p><strong><span style="color: rgb(255, 0, 0);">Random Text 4.33.6 r98549<br>

我想解析文本并提取与红线对应的修订号。所以在这个例子中，我想解析出98543和98549。

我能够通过以下方式解析所有行：

paragraphs = re.findall(r'r(\d*)<br>',str(html))

然而，我有点坚持如何做到这样我只能找到红线。我目前的代码还包括72440.任何想法如何解决这个问题？谢谢！

Answer 1

您需要使用HTML解析器来帮助您过滤掉应用了红色的标记，然后在标记的内容上使用正则表达式：

>>> from bs4 import BeautifulSoup
>>> html = ''' (your html here) '''
>>> parser = BeautifulSoup(html, 'html.parser')
>>> for span_tag in parser.find_all('span', style='color: rgb(255, 0, 0);'):
...  print(span_tag.text)

Random Text 4.23.6 r98543

然后，您可以收集所有文本，并在其上运行正则表达式以过滤掉版本号：

>>> t = [i.text for i in parser.find_all('span', style='color: rgb(255, 0, 0);')]

Answer 2

如果您知道自己仅查找包含模式color: rgb(255, 0, 0)的行，请将该模式添加到正则表达式中：

paragraphs = re.findall(r'color: rgb\(255, 0, 0\).*r(\d*)<br>',str(html))

如何使用正则表达式/ Python在已知字符串，未知字符串和另一个已知字符串之后查找所有整数？

2 个答案: