Question

我正在使用python和BeautifulSoup解析许多大型XML文件。我经常遇到以下任务：

<Section1>
    <Report>
        <Matrix>...</Matrix>
        <Matrix>...</Matrix>
        <Matrix>...</Matrix>
        <Matrix>...</Matrix>
    </Report>
</Section1>

我正在尝试收集并遍历所有矩阵。我使用如下代码：

res = urlopen(url)
html = res.read()
soup = BeautifulSoup(html, 'xml')
matrices = soup.find("Section1").find_all("Matrix")
#Then I handle each matrix

为什么我不能使用这样的选择器？

matrices = soup.find("Section1 Matrix")

有更快的方法吗？有时我访问嵌套在XML中更远的节点，我需要确保它们是后代，但不一定是其他几个节点的直接子节点。提供的示例是简化。任何帮助将不胜感激。

Answer 1

BeautifulSoup "supports CSS selectors"您需要将选择器传递给.select方法

In [1]: from bs4 import BeautifulSoup as BS

In [2]: soup = BS("""<Section1>
   ...:     <Report>
   ...:         <Matrix>...</Matrix>
   ...:         <Matrix>...</Matrix>
   ...:         <Matrix>...</Matrix>
   ...:         <Matrix>...</Matrix>
   ...:     </Report>
   ...: </Section1>""", "xml")

In [3]: soup.select("Section1 Matrix")
Out[3]: 
[<Matrix>...</Matrix>,
 <Matrix>...</Matrix>,
 <Matrix>...</Matrix>,
 <Matrix>...</Matrix>]

如果你想要的是获取文档中的所有Matrix个节点;您可以使用来自CSSSelector ¹的lxml.cssselect类。

In [3]: from lxml.etree import fromstring

In [4]: xml_doc = '''<Section1>
   ...:     <Report>
   ...:         <Matrix>...</Matrix>
   ...:         <Matrix>...</Matrix>
   ...:         <Matrix>...</Matrix>
   ...:         <Matrix>...</Matrix>
   ...:     </Report>
   ...: </Section1>'''

In [5]: tree = fromstring(xml_doc)

In [6]: matrix = [el for el in sel(tree)]

In [7]: matrix
Out[7]: 
[<Element Matrix at 0x7f84b5b8f388>,
 <Element Matrix at 0x7f84b5b8fc48>,
 <Element Matrix at 0x7f84b5b8fd88>,
 <Element Matrix at 0x7f84b5b8fdc8>]

¹如果cssselect尚未安装pip，则需要安装css：pip install cssselect

Python美丽的汤找到标签的最有效方法

1 个答案: