我目前正在尝试从html文件中获取数据。似乎我正在使用的代码有效,但并不像我期望的那样。我可以得到一些项目,但不是全部,我想知道它是否与我试图阅读的文件的大小有关。
我目前正在尝试解析this webpage的来源。
此页面长4500行,因此尺寸相当不错。我一直在使用这个页面,因为我想确保代码适用于大文件。
我正在使用的代码是:
import lxml.html
import lxml
import urllib2
webHTML = urllib2.urlopen('http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html').read()
webHTML = lxml.html.fromstring(webHTML)
productDetails = webHTML.get_element_by_id('productDetails')
for element in productDetails:
print element.text_content()
当我使用'mm3'的element_id或顶部附近的东西时,这给出了预期的输出,但如果我使用'productDetails'的element_id,我得不到输出。至少我在目前的设置上做了。
答案 0 :(得分:1)
我害怕lxml.html
无法处理解析此特定HTML源代码。它将h3
标记与id="productDetails"
解析为空元素(并且位于default "recover" mode中):
<h3 class="productDescription2" id="productDetails" itemprop="description"></h3>
使用BeautifulSoup
切换到html5lib
parser(非常宽松):
from urllib2 import urlopen
from bs4 import BeautifulSoup
url = 'http://hobbyking.com/hobbyking/store/__39036__Turnigy_Multistar_2213_980Kv_14Pole_Multi_Rotor_Outrunner.html'
soup = BeautifulSoup(urlopen(url), 'html5lib')
for element in soup.find(id='productDetails').find_all():
print element.text
打印:
Looking for the ultimate power system for your next Multi-rotor project? Look no further!The Turnigy Multistar outrunners are designed with one thing in mind - maximising Multi-rotor performance! They feature high-end magnets, high quality bearings and all are precision balanced for smooth running, these motors are engineered specifically for multi-rotor use.These include a prop adapter and have a built in aluminium mount for quick and easy installation on your multi-rotor frame.
outrunner
...