从所有匹配节点提取信息,而不循环xpath

时间:2017-11-14 16:26:04

标签: python-2.7 xpath scrapy

<ul class="products-grid">
    <li class="item">
        <div class="product-block">
            <div class="product-block-inner">
                <a href="#" title="Product A" class="product-image"><img src="#/producta.jpg"></a>
                <h2 class="product-name"><a href="#">Product A</a></h2>
                <div class="price-box">
                    <span class="regular-price" id="#">
                        <span class="price">Rs 1,849</span>
                    </span>
                </div>
            </div>
        </div>
    </li>
    <li class="item">
        <div class="product-block">
            <div class="product-block-inner">
                <a href="#" title="Product B" class="product-image"><img src="#/productb.jpg"></a>
                <h2 class="product-name"><a href="#">Product B</a></h2>
                <div class="price-box">
                    <span class="regular-price" id="#">
                        <span class="price">Rs 1,849</span>
                    </span>
                </div>
            </div>
        </div>
    </li>
</ul>

此刻我正在循环中抓取item

products = response.xpath('//ul[@class="products-grid"]//li//div[@class="product-block"]//div[@class="product-block-inner"]').extract()

获取product-block-inner节点后,我将其保存到products然后我必须像

一样循环
for product in products:
   // parse the div.product-block-inner further deep down
   // to get name, price, image etc
   // and save it to a dict and yeild
   pass

这是否有可能我得到最终列表中所有div.product-block-inner的文本,href而没有循环

1 个答案:

答案 0 :(得分:1)

是的,但这很令人困惑,例如你可以试试这个:

products = response.xpath(
    '//ul[@class="products-grid"]//li//div[@class="product-block"]//div[@class="product-block-inner"]'
).css(
    '.product-name a::attr(href), .product-name a::text, .price::text'
).extract()

但我建议总是循环播放(顺便说一句,为什么在将extract()分配给products时调用products = response.xpath( '//ul[@class="products-grid"]//li//div[@class="product-block"]//div[@class="product-block-inner"]' ) for product in products: yield {'name': product.css('.product-name a::text').extract_first() 'url': product.css('.product-name a::attr(href)').extract_first() 'price': product.css('.price::text').extract_first()} ?)

... 
TcpClient client = server.EndAcceptTcpClient(ar);

int timeout = (int)TimeSpan.FromSeconds(3).TotalMilliseconds;

client.ReceiveTimeout = timeout;
client.SendTimeout = timeout;
...

(在这种情况下我使用了css选择器,因为等效的xpath更长,但使用xpath也可以实现相同的效果)