如何使用scrapy限制spider抓取某些xPaths

时间:2013-06-25 12:12:50

标签: python scrapy

我正在尝试抓取一个网站,从产品页面我试图废弃产品说明但我如何只选择产品说明:

link to page

xPath : hxs.select('//div[@class="product-shop"]/p/text()').extract()

HTML非常大,请参阅上面指定的链接..

我只想选择产品说明而不是其他细节...

如果我这样做:

[" ".join([i.strip() for i in hxs.select('//div[@class="product-shop"]/p/text()').extract()])]

output : 
[u'Itemcode: 12BTS28271 Brand: BASICS InStock - Ships within 2 business days. Tip: 90% of our shipments reach within 4 business days! This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.']

但我只想:

[u'This product is part of the Basics T.shirts line made of 100% Cotton. Stripes Muscle Fit T.shirts that come in Green Color. Casual that comes with Henley away.']

1 个答案:

答案 0 :(得分:2)

右键点击chrome中元素面板中的元素告诉我:

enter image description here

//*[@id="product_addtocart_form"]/div[2]/div[1]/p[3]

指向

<p>This product is part of the Basics T.shirts line made of 100% Cotton.<br>
                        Stripes Muscle Fit T.shirts that come in Green Color.<br>
                        Casual that comes with Henley away.</p>

this page上尝试相同的XPATH也指向那里的描述:

<p>This product is part of the Basics Shirts line made of 100% Cotton.<br>
                    Plain Slim Fit Shirts that come in Orange Color.<br>
                    Casual that comes with Button Down away.</p>

所以看起来你需要做的就是在页面上调用XPATH并进行设置。你仍然应该验证XPATH在所有情况下都能正常工作,因为它总是容易根据页面而改变。