Question

我使用Beautiful Soup来解析http://rtw.ml.cmu.edu/rtw/kbbrowser/中的类别列表，我得到了此页面的HTML代码：

<html>
    <head>
        <link href="../css/browser.css" rel="stylesheet" type="text/css"/>
        <script type="text/javascript">
            if (parent.location.href == self.location.href) {
                if (window.location.href.replace)
                    window.location.replace('index.php');
                else
                    // causes problems with back button, but works
                    window.location.href = 'index.php';
            }
        </script>
    </head>
    <body id="ontology">
    ...
    </body>
</html>

我使用非常简单的代码，但是当我尝试访问<body>元素时，我得到None：

import urllib
from BeautifulSoup import BeautifulSoup
from bs4 import BeautifulSoup
import mechanize
from mechanize import Browser
import requests
import re
import os

link = 'http://rtw.ml.cmu.edu/rtw/kbbrowser/ontology.php'
pageFile = urllib.urlopen(link).read()
soup = BeautifulSoup(pageFile)

print soup.head.contents[0].name
print soup.html.contents[1].name

为什么这种情况下的头部元素没有兄弟？
我得到了：

AttributeError：＆＃39; NoneType＆＃39;对象没有属性＆＃39; next_element＆＃39;

尝试获取head.next_Sibling时。

Answer 1

这是因为文本节点也是contents的一部分。

使用CSS selectors来定位类别列表，而不是操作contents属性。例如，以下是列出顶级类别的方法：

for li in soup.select("body#ontology > ul > li"):
    print li.find_all("a")[-1].text

使用Beautifulsoup解析NELL知识库页面

1 个答案: