Question

如果我有这样的话：

<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>

如何选择具有foo命名空间属性的元素？

E.g。我想要返回第2和第3个p元素。

Answer 1

来自documentation：

Beautiful Soup提供了一个名为attrs的特殊参数，您可以在这些情况下使用它。 attrs是一个字典，其行为与关键字参数类似：

soup.findAll(id=re.compile("para$"))
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

soup.findAll(attrs={'id' : re.compile("para$")})
# [<p id="firstpara" align="center">This is paragraph <b>one</b>.</p>,
#  <p id="secondpara" align="blah">This is paragraph <b>two</b>.</p>]

如果需要对名称为Python保留字的属性（如class，for或import）设置限制，则可以使用attrs;或名称为Beautiful Soup搜索方法的非关键字参数的属性：name，recursive，limit，text或attrs本身。

from BeautifulSoup import BeautifulStoneSoup
xml = '<person name="Bob"><parent rel="mother" name="Alice">'
xmlSoup = BeautifulStoneSoup(xml)

xmlSoup.findAll(name="Alice")
# []

xmlSoup.findAll(attrs={"name" : "Alice"})
# [parent rel="mother" name="Alice"></parent>]

所以对于你给出的例子：

soup.findAll(attrs={ "foo" : re.compile(".*") })
# or
soup.findAll(attrs={ re.compile("foo:.*") : re.compile(".*") })

Answer 2

BeautifulSoup（版本3和版本4）似乎不会将名称空间前缀视为特殊内容。它只是将namespace-prefix和namespaced属性视为恰好在其名称中包含冒号的属性。

因此，要在<p>命名空间中找到包含属性的foo元素，您只需遍历所有属性键并检查是否attr.startswith('foo')：

import BeautifulSoup as bs
content = '''\
<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>'''

soup = bs.BeautifulSoup(content)
for p in soup.find_all('p'):
    for attr in p.attrs.keys():
        if attr.startswith('foo'):
            print(p)
            break

产量

<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p>

使用lxml，您可以通过XPath进行搜索，XPath具有通过命名空间搜索属性的语法支持：

import lxml.etree as ET
content = '''\
<root xmlns:foo="bar">
<p>blah</p>
<p foo:bar="something">blah</p>
<p foo:xxx="something">blah</p></root>'''

root = ET.XML(content)
for p in root.xpath('p[@foo:*]', namespaces={'foo':'bar'}):
    print(ET.tostring(p))

产量

<p xmlns:foo="bar" foo:bar="something">blah</p>
<p xmlns:foo="bar" foo:xxx="something">blah</p>

查找给定命名空间属性的所有元素

2 个答案: