Python和lxml.html get_element_by_id输出问题

时间:2014-12-26 07:10:30

标签: python html html-parsing lxml lxml.html

我目前正在尝试从html文件中获取数据。似乎我正在使用的代码有效,但并不像我期望的那样。我可以得到一些项目,但不是全部,我想知道它是否与我试图阅读的文件的大小有关。

我目前正在尝试解析this webpage的来源。

此页面长4500行,因此尺寸相当不错。我一直在使用这个页面,因为我想确保代码适用于大文件。

我正在使用的代码是:

import lxml.html
import lxml
import urllib2

webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
    print element.text_content()

当我使用'mm3'的element_id或顶部附近的东西时,这给出了预期的输出,但如果我使用'productDetails'的element_id,我得不到输出。至少我在目前的设置上做了。

1 个答案:

答案 0 :(得分:1)

我害怕lxml.html无法处理解析此特定HTML源代码。它将h3标记与id="productDetails"解析为空元素(并且位于default "recover" mode中):

<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>

使用BeautifulSoup切换到html5lib parser非常宽松):

from urllib2 import urlopen
from bs4 import BeautifulSoup

url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')

for element in soup.find(id='productDetails').find_all():
    print element.text

打印:

Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.

outrunner

...