Question

鉴于this页面，我希望了解样式ID的值：

我使用浏览器的开发者工具获取唯一选择器：

li.attribute-list-item:nth-child(1) > span:nth-child(1)

然后使用urllib2和lxml的CSS功能：

import urllib2
from lxml import etree 
from lxml.cssselect import CSSSelector    
req = urllib2.Request(url, headers={'User-Agent' : "Magic Browser"}) 
con = urllib2.urlopen( req )
htmlparser = etree.HTMLParser()
tree = etree.parse(con, htmlparser)
x = CSSSelector('li.attribute-list-item:nth-child(1) > span:nth-child(1)')

如果我然后获得x（树）的单个元素的文本值：

它给我的文字＆＃39;样式ID＆＃39;而不是它后面的实际值。这是它的外观：

如何获取号码（在本例中为555088 117）？我也欢迎基于BeautifulSoup的建议。

编辑：我专门寻找基于CSS（类名或选择器）的方法。

Answer 1

使用requests + lxml ：

import requests
from lxml import html

response = requests.get("http://www.flightclub.com/air-jordan-1-retro-high-og-unc-white-dk-powder-blue-012304")
tree = html.fromstring(response.content)

style_id = tree.xpath('//ul[@class="mb-padding product-attribute-list"]/li[@class="attribute-list-item"][1]/text()[2]')[0].replace(',','').strip()
print style_id

<强>输出：

555088 117

注意：

要避免 IndexError: list index out of range 以防网站结构发生变化，您可以替换：

style_id = tree.xpath('//ul[@class="mb-padding product-attribute-list"]/li[1]/text()[2]')[0].replace(',','').strip()

使用：

style_id = ''.join(tree.xpath('//ul[@class="mb-padding product-attribute-list"]/li[1]/text()[2]')).replace(',','').strip()

如何使用lxml（或BeautifulSoup）在两个跨度之间提取文本？

1 个答案: