Question

有些软件包用于从xml内容中解析dom树，例如https://docs.python.org/2/library/xml.dom.minidom.html。

但我不想针对xml，只有html网站页面内容。

from htmldom import htmldom
dom = htmldom.HtmlDom( "http://www.yahoo.com" ).createDom()
# Find all the links present on a page and prints its "href" value
a = dom.find( "a" )
for link in a:
    print( link.attr( "href" ) )

但为此我收到此错误：

Error while reading url: http://www.yahoo.com
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/htmldom/htmldom.py", line 333, in createDom
    raise Exception
Exception

看我已经检查过 BeautifulSoup ，但这不是我想要的。 Beautifulsoup仅适用于html页面。如果使用Javascript动态加载页面内容，则失败。我不想使用getElementByClassName和类似的方法解析元素。但是dom.children(0).children(1)这样的事情。

那么有没有什么方法可以使用无头浏览器，selenium使用它来解析整个DOM树结构并通过子和子，我可以访问targget元素？

Answer 1

Python Selenium API为您提供您可能需要的一切。你可以从

开始

typedef struct 
{
    int ID;
    int IDcli;
    char Name[50];
} Example;

Example e[5][5];

int getCli() {
    int i=0,ID=0;

    for(i=0;i<5;i++){
        if(e[i][0].IDcli>0)
            /* 
                each time it passes on same IDcli it increments
                but the IDcli isnt constant
            */

    }

    return ID;
}

或

html = driver.find_element_by_tag_name("html")

然后从那里开始

body = driver.find_element_by_tag_name("body")

等同于＆＃34; body.find_element_by_xpath('/*[' + str(x) + ']')＆＃34;。你不需要使用BeautifulSoup或任何其他DOM遍历框架，但你当然可以通过获取页面源并让它由另一个库如BeautifulSoup解析：

body.children(x-1)

Answer 2

是的，但要将代码包含在SO帖子中并不够简单。你虽然走在了正确的轨道上。

基本上你需要使用你选择的无头渲染器（例如Selenium）来下载所有资源并执行javascript。在那里重新发明轮子确实没有用。

然后，您需要将无头渲染器中的HTML回显到页面就绪事件上的文件（我使用的每个无头浏览器都提供此功能）。此时，您可以在该文件上使用BeautifulSoup来导航DOM。 BeautifulSoup确实支持基于孩子的遍历：http://www.crummy.com/software/BeautifulSoup/bs4/doc/#going-down

有没有办法解析网站内容的DOM树？

2 个答案: