Question

我正在尝试使用Beautifulsoup解析XML，但在尝试使用findall（）

的“recursive”属性时碰到了一堵砖墙

我有一个非常奇怪的xml格式如下所示：

<?xml version="1.0"?>
<catalog>
   <book>
      <author>Gambardella, Matthew</author>
      <title>XML Developer's Guide</title>
      <genre>Computer</genre>
      <price>44.95</price>
      <publish_date>2000-10-01</publish_date>
      <description>An in-depth look at creating applications 
      with XML.</description>
      <book>true</book>
   </book>
   <book>
      <author>Ralls, Kim</author>
      <title>Midnight Rain</title>
      <genre>Fantasy</genre>
      <price>5.95</price>
      <publish_date>2000-12-16</publish_date>
      <description>A former architect battles corporate zombies, 
      an evil sorceress, and her own childhood to become queen 
      of the world.</description>
      <book>false</book>
   </book>
 </catalog>

正如您所看到的，book标签在book标签内重复出现，当我尝试以下内容时会导致错误：

from BeautifulSoup import BeautifulStoneSoup as BSS

catalog = "catalog.xml"


def open_rss():
    f = open(catalog, 'r')
    return f.read()

def rss_parser():
    rss_contents = open_rss()
    soup = BSS(rss_contents)
    items = soup.findAll('book', recursive=False)

    for item in items:
        print item.title.string

rss_parser()

正如你将看到的那样，在我的汤上.findAll我添加了recursive = false，理论上它不会通过找到的项目进行递归，但跳到下一个。

这似乎不起作用，因为我总是收到以下错误：

  File "catalog.py", line 17, in rss_parser
    print item.title.string
AttributeError: 'NoneType' object has no attribute 'string'

我确信我在这里做了些蠢事，如果有人能帮我解决这个问题，我会很感激。

更改HTML结构不是一个选项，此代码需要执行良好，因为它可能会解析大型XML文件。

Answer 1

soup.findAll('catalog', recursive=False)将返回仅包含您的顶级“目录”标记的列表。由于没有“标题”子项，item.title为None。

请尝试soup.findAll("book")或soup.find("catalog").findChildren()。

编辑：好的，问题不是我想的那样。试试这个：

BSS.NESTABLE_TAGS["book"] = []
soup = BSS(open("catalog.xml"))
soup.catalog.findChildren(recursive=False)

Answer 2

问题出现在嵌套的book标记中。 BautifulSoup有一组可以嵌套的预定义标签（BeautifulSoup.NESTABLE_TAGS），但它不知道book可以嵌套，所以它会成为奇迹。

Customizing the parser，解释了发生了什么，以及如何将BeautifulStoneSoup子类化为自定义嵌套标记。以下是我们如何使用它来解决您的问题：

from BeautifulSoup import BeautifulStoneSoup

class BookSoup(BeautifulStoneSoup):
  NESTABLE_TAGS = {
      'book': ['book']
  }

soup = BookSoup(xml) # xml string omitted to keep this short
for book in soup.find('catalog').findAll('book', recursive=False):
  print book.title.string

如果我们运行它，我们得到以下输出：

XML Developer's Guide
Midnight Rain

Answer 3

Beautifulsoup慢而且死了，改用lxml：）

>>> from lxml import etree
>>> rss = open('/tmp/catalog.xml')
>>> items = etree.parse(rss).xpath('//book/title/text()')
>>> items
["XML Developer's Guide", 'Midnight Rain']
>>>

BeautifulSoup嵌套标签

3 个答案: