Question

我正在尝试在HTML文档中搜索特定的属性值。 e.g。

<html> 
  <h2 itemprop="prio1">  TEXT PRIO 1 </h2>
  <span id="prio2"> TEXT PRIO 2 </span>
</html>

我想查找所有以“prio”开头的atrributes值的项目

我知道我可以做类似的事情：

soup.find_all(itemprop=re.compile('prio.*')) )

或者

soup.find_all(id=re.compile('prio.*')) )

但我正在寻找的是：

soup.find_all(*=re.compile('prio.*')) )

Answer 1

首先关闭你的正则表达式是错误的，如果你只想找到以 prio 开头的字符串，你的前缀是^，因为你的正则表达式会匹配字符串中任何地方的prio，如果你要搜索每个属性，你应该只使用 str.startswith ：

h = """<html>
  <h2 itemprop="prio1">  TEXT PRIO 1 </h2>
  <span id="prio2"> TEXT PRIO 2 </span>
</html>"""

soup = BeautifulSoup(h, "lxml")


tags = soup.find_all(lambda t: any(a.startswith("prio") for a in t.attrs.values()))

如果您只是想检查某些属性：

tags = soup.find_all(lambda t: t.get("id","").startswith("prio") or t.get("itemprop","").startswith("prio"))

但如果您想要一个更有效的解决方案，您可能需要查看允许您使用通配符的lxml：

from lxml import html

xml = html.fromstring(h)

tags = xml.xpath("//*[starts-with(@*,'prio')]")
print(tags)

或者只是id一个itemprop：

tags = xml.xpath("//*[starts-with(@id,'prio') or starts-with(@itemprop, 'prio')]")

Answer 2

我不知道这是不是最好的方法，但这有效：

>>> soup.find_all(lambda element: any(re.search('prio.*', attr) for attr in element.attrs.values()))
[<h2 itemprop="prio1">  TEXT PRIO 1 </h2>, <span id="prio2"> TEXT PRIO 2 </span>]

在这种情况下，您可以访问lambda中的元素使用lambda element:。我们会在'prio.*'列表中搜索re.search使用element.attrs.values()。

然后，我们在结果上使用any()来查看是否存在具有属性的元素，并且其值以'prio'开头。

你也可以在这里使用str.startswith代替RegEx，因为你只是想检查属性值是否以'prio'开头，如下所示：

soup.find_all(lambda element: any(attr.startswith('prio') for attr in element.attrs.values())))

BeautifulSoup搜索属性 - 值

2 个答案: