Question

假设：

import urllib2
from lxml import etree

url =  "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
response = urllib2.urlopen(url)
htmlparser = etree.HTMLParser()
tree = etree.parse(response, htmlparser)

其中URL是标准的易趣搜索结果页面，其中应用了一些过滤功能：

我希望提取产品价格，例如$ 40.00，$ 34.95等。

有一些可能的XPath（由Firebug提供，XPath Checker Firefox附加组件以及对源代码的手动检查）：

/html/body/div[5]/div[2]/div[3]/div/div[1]/div/div[3]/div/div[1]/div/w-root/div/div/ul/li[1]/ul[1]/li[1]/span
id('item3d00cf865e')/x:ul[1]/x:li[1]/x:span
//span[@class ='bold bidsold']

选择后者：

xpathselector="//span[@class ='bold bidsold']"

tree.xpath(xpathselector)然后按预期返回Element个对象的列表。当我获得.text属性时，我本来希望得到价格。但我得到的是：

In [17]: tree.xpath(xpathselector)
Out[17]: 
['\n\t\t\t\t\t',
 u' 1\xc2\xa0103.78',
 '\n\t\t\t\t\t',
 u' 1\xc2\xa0048.28',
 '\n\t\t\t\t\t',
 ' 964.43',
 '\n\t\t\t\t\t',
 ' 922.43',
 '\n\t\t\t\t\t',
 ' 922.43',
 '\n\t\t\t\t\t',
 ' 275.67',
 '\n\t\t\t\t\t',

每个中包含的值看起来都像价格，但（i）价格远远高于网页上显示的价格，（ii）我想知道所有新行和标签在那里做了什么。 在试图提取价格方面，我有什么根本的错误吗？

我通常使用WebDriver来做这类事情，并利用css选择器，xpath和类来查找元素。但在这种情况下，我不想进行浏览器互动，这就是我第一次使用urllib2和lxml的原因。

等

Answer 1

我看到两种可能的情况：

看起来ebay会根据您所在国家/地区的货币检查您的区域设置并转换价格。一旦您通过浏览器打开页面，它可能会读取一些浏览器设置，一旦您执行代码，它就可以从其他地方读取设置。
ebay可以使用javascript（客户端）调整价格，这样您就无法通过解析器捕获它。

我建议接下来检查一下：

检查运行代码时的货币
检查页面来源并确认其价格与您在浏览器中看到的完全相同。

Answer 2

我在python上写了两个例子

示例1：

import urllib2
from lxml import etree

if __name__ == '__main__':
    url =  "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
    response = urllib2.urlopen(url)
    htmlparser = etree.HTMLParser()
    tree = etree.parse(response, htmlparser)
    xpathselector="//span[@class ='bold bidsold']"
    for i in tree.xpath(xpathselector):
        print "".join(filter(lambda x: ord(x)<64, i.text)).strip()

示例2：

import urllib2
from lxml import etree

if __name__ == '__main__':
    url =  "http://www.ebay.com/sch/i.html?rt=nc&LH_Complete=1&_nkw=Under+Armour+Dauntless+Backpack&LH_Sold=1&_sacat=0&LH_BIN=1&_from=R40&_sop=3&LH_ItemCondition=1000"
    response = urllib2.urlopen(url)
    htmlparser = etree.HTMLParser()
    tree = etree.parse(response, htmlparser)
    xpathselector="//span[@class ='bold bidsold']|//span[@class='sboffer']"
    for i in tree.xpath(xpathselector):
        print "".join(filter(lambda x: ord(x)<64, i.text)).strip()

用lxml从span中提取文本？

2 个答案: