Question

我要满足2个条件，以便在抓取的html上使用正则表达式。所有示例代码都是字符串：

ex_string = <p>40% flights: Private bookings 20-15% bonus: Private airfairs 10% Excellence: Public Vacation 5-0% persons: Public Sightseeing</p>

我正在使用re.findall(r'\d+%', ex_string)，它产生： ['40％'，'15％'，'10％'，'0％']

但是在20-15％的情况下，我需要在输出中获得'20 -15％'而不是15％。

<table border="0" style="border-collapse: collapse; width: 100%;"> <tbody> <tr> <td style="width: 50%;">85%</td>

在这里使用re.findall(r'\d+%', ex_string)会得到['100％'，'85％]，但是我只希望'width：'不在前面的百分比。

第二个示例的预期结果为['85％']。

需要进行哪些修改才能满足这两个要求？

Answer 1

使用HTML解析器将使此过程变得更加简单。如果您想使用正则表达式解决方案，可以采用消极的态度。

import re

ex_string = """
<p>40% flights: Private bookings 20-15% bonus: Private airfairs 10% Excellence: Public Vacation 5-0% persons: Public Sightseeing</p>
<table border="0" style="border-collapse: collapse; width: 100%;">
<tbody>
<tr>
<td style="width: 50%;">85%</td>
"""

g = re.findall(r'(?<!width: )(?<!\d)(\d+%|\d+\-\d+%)', ex_string)
print(g)

，这意味着width:和\d不应位于(\d+%|\d+\-\d+%)之前。

输出：

['40%', '20-15%', '10%', '5-0%', '85%']

Python正则表达式可查找html标签内容中的所有百分比

1 个答案: