假设:
import urllib2
from lxml import etree
url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
其中URL是标准的易趣搜索结果页面,其中应用了一些过滤功能:
我希望提取产品价格,例如$ 40.00,$ 34.95等。
有一些可能的XPath(由Firebug提供,XPath Checker Firefox附加组件以及对源代码的手动检查):
/html/body/div[5]/div[2]/div[3]/div/div[1]/div/div[3]/div/div[1]/div/w-root/div/div/ul/li[1]/ul[1]/li[1]/span
id('item3d00cf865e')/x:ul[1]/x:li[1]/x:span
//span[@class ='bold bidsold']
选择后者:
xpathselector="//span[@class ='bold bidsold']"
tree.xpath(xpathselector)
然后按预期返回Element
个对象的列表。当我获得.text
属性时,我本来希望得到价格。但我得到的是:
In [17]: tree.xpath(xpathselector)
Out[17]:
['\n\t\t\t\t\t',
u' 1\xc2\xa0103.78',
'\n\t\t\t\t\t',
u' 1\xc2\xa0048.28',
'\n\t\t\t\t\t',
' 964.43',
'\n\t\t\t\t\t',
' 922.43',
'\n\t\t\t\t\t',
' 922.43',
'\n\t\t\t\t\t',
' 275.67',
'\n\t\t\t\t\t',
每个中包含的值看起来都像价格,但(i)价格远远高于网页上显示的价格,(ii)我想知道所有新行和标签在那里做了什么。 在试图提取价格方面,我有什么根本的错误吗?
我通常使用WebDriver来做这类事情,并利用css选择器,xpath和类来查找元素。但在这种情况下,我不想进行浏览器互动,这就是我第一次使用urllib2
和lxml
的原因。
等
答案 0 :(得分:1)
我看到两种可能的情况:
我建议接下来检查一下:
答案 1 :(得分:1)
我在python上写了两个例子
示例1:
import urllib2
from lxml import etree
if __name__ == '__main__':
url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
xpathselector="//span[@class ='bold bidsold']"
for i in tree.xpath(xpathselector):
print "".join(filter(lambda x: ord(x)<64, i.text)).strip()
示例2:
import urllib2
from lxml import etree
if __name__ == '__main__':
url = "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)
xpathselector="//span[@class ='bold bidsold']|//span[@class='sboffer']"
for i in tree.xpath(xpathselector):
print "".join(filter(lambda x: ord(x)<64, i.text)).strip()