Question

我正在尝试从“previous-sibling”中的onclick标签中提取产品ID，这是一个ul标签（id =“ShowProductImages”）。

我想要提取的数字直接在？pid =，例如：

之后

...列表/ ViewAll PID = 234565 ＆安培;图像= 206 ...

以下是我要提取的内容：

<ul id="ShowProductImages" class="imageView">
    <li><a href="" target="_blank" onClick="javascript:initWindow('http://products.example.com/products/list/ViewAll?pid=234565&amp;image=754550',520,520,100,220);return false;"><img src="http://content.example.com/assets/images/products/j458jk.jpg" width="200" height="150" alt="Product image description here" border="0"></a></li>        
</ul>

<div class="description">
    Description here...
</div>

我正在使用xpath选择onclick标记以及正则表达式来提取id。这是我正在使用的代码（不起作用）

def parse(self, response):
  sel = HtmlXPathSelector(response)
  products_path = sel.xpath('//div[@class="description"]')
  for product_path in products_path:
   product = Product()
   product['product_pid'] = product_path.xpath('preceding-sibling::ul[@id="ShowProductImages"][1]//li/a[1]/@onclick').re(r'(?:pid=)(.+?)(?:\'|$)')
   yield product

有什么建议吗？我不太确定我哪里出错了。

提前感谢您的帮助。

Answer 1

我建议您尝试这一点，从ul中选择，并在谓词中测试其<div class="description">兄弟：

sel.xpath("""//ul[following-sibling::div[@class="description"]]
                 [@id="ShowProductImages"]
                 /li/a[1]/@onclick""").re(r'(?:pid=)(\d+)')

我将正则表达式更改为限制为数字。

scrapy spider中的xpath / regex问题

1 个答案: