Question

我正在使用BeautifulSoup和Python来抓取网页。我有一个BS元素，

a = soup.find('div', class_='section lot-details')

返回一系列列表对象，如下所示。

<li><strong>Location:</strong> WA - 222 Welshpool Road, Welshpool</li>
<li><strong>Deliver to:</strong> Pickup Only WA</li>

我想在每个str之后返回文本

WA - 222 Welshpool Road, Welshpool
Pickup Only WA

如何从BS对象中获取此信息？我不确定正则表达式，以及它与BeautifulSoup的交互方式。

Answer 1

(?:</strong>)(.*)(?:</li>)捕获字段\1 (.*)可以完成工作。

Python代码示例：

In [1]: import re
In [2]: test = re.compile(r'(?:</strong>)(.*)(?:</li>)')
In [3]: test.findall(input_string)
Out[1]: [' WA - 222 Welshpool Road, Welshpool', ' Pickup Only WA']

在此处查看https://regex101.com/r/fD0fZ9/1

Answer 2

你真的不需要正则表达式。如果列表中有li个标记：

>>> for li in li_elems:
...     print li.find('strong').next_sibling.strip()

WA - 222 Welshpool Road, Welshpool
Pickup Only WA

假设strong中只有一个li元素，之后是文字。

或者，或者：

>>> for li in li_elems:
...     print li.contents[1].strip()

WA - 222 Welshpool Road, Welshpool
Pickup Only WA

Python beautifulsoup在字符串后匹配正则表达式

2 个答案: