XPath返回空列表(命名空间问题?)

时间:2016-06-25 18:12:49

标签: python xpath namespaces

我希望以下代码能够返回文本" In Stock"或"缺货" (检查在线商店的库存)但它只返回" []"。 XPath代码是从浏览器的元素检查器获得的,似乎是有效的。我在网上读到了可能存在问题的命名空间。提示?

from lxml import html
import requests

url = 'http://www.thesource.ca/en-ca/computers-and-tablets/computer-accessories/mice/logitech-m310-wireless-mouse/p/2618659'
path = '//*[@id="content"]/section/section/div/font/div[7]/div/div[1]/div[2]/ul/li[1]/div/text()'

page = requests.get(url)
tree = html.fromstring(page.content)
stock = tree.xpath(path)
print(stock)

编辑:解决方案基于Padraic Cunningham的帖子。

由于依赖于某些绝对路径,但仍然不是最优雅的,但至少这是有效的:

from lxml import html
import requests
import re

# in stock example URL
#url = 'http://www.thesource.ca/en-ca/computers-and-tablets/computer-accessories/mice/logitech-m310-wireless-mouse/p/2618659'

# out of stock example URL
url = 'http://www.thesource.ca/en-ca/computers-and-tablets/computer-accessories/mice/microsoft-basic-optical-mouse/p/108029878'

path = '//ul[@class="availability"]/li[./div[1]]'
inner_path = './div[1]/text()'

page = requests.get(url)
tree = html.fromstring(page.content)
stock = tree.xpath(path)
current = stock[0].xpath(inner_path)

print(current[0])
if re.search(r'in.*stock.*online', current[0], flags=re.IGNORECASE):
    print "Success!"
else:
    print "Keep waiting..."

1 个答案:

答案 0 :(得分:1)

你的xpath错了:

 from lxml import html
import requests

url = 'http://www.thesource.ca/en-ca/computers-and-tablets/computer-accessories/mice/logitech-m310-wireless-mouse/p/2618659'
path = '//ul[@class="availability"]/li[./div[@class="availability-text in-stock"]]'

page = requests.get(url)
tree = html.fromstring(page.content)

stock = tree.xpath(path)
current = stock[0].xpath('./div[@class="availability-text in-stock"]/text()')
print(current[0])
for node in stock[1:]:
    print(node.xpath('./div[@class="availability-text in-stock"]/a/@aria-label'))

这给了你:

  In Stock Online
In Stock   YORKDALE  MALL
In Stock   LAWRENCE SQUARE

可用性位于带有availability类的无序列表中,我们的路径 xpath将所有具有availability-text in-stock类div的li子项拉入所有divs bar第一个就是有一个锚点:

            <a class="underline"
            aria-label="In Stock &nbsp; YORKDALE  MALL"
            title="View Store Details"
            href="#product-store-availability">
                YORKDALE  MALL</a>

您可以看到aria标签包含可用性和商店。

如果您想要分享可用性和商店,可以拆分&amp; nbsp:

print(node.xpath('./div[@class="availability-text in-stock"]/a/@aria-label')[0].split("\xa0"))

哪会给:

['In Stock ', ' YORKDALE  MALL']
['In Stock ', ' LAWRENCE SQUARE']

你的浏览器工具在抓取时是必不可少的,只要不要依赖于他们给你的xpath / select,当你右键单击并选择copy xpath / selector时,看看源代码并尝试查找与您尝试解析的内容相关联的ID或类名。

如果您只想要第一个,您仍然可以使用xpath:

url = 'http://www.thesource.ca/en-ca/computers-and-tablets/computer-accessories/mice/logitech-m310-wireless-mouse/p/2618659'
path = '(//ul[@class="availability"]/li/div[@class="availability-text in-stock"])[1]/text()'

page = requests.get(url)
tree = html.fromstring(page.content)
stock = tree.xpath(path)
success = {"in","stock"}

if stock and all(w in success for w in stock[0].lower().split()):
    print("Success")
else:
    print("Failure")