在刮擦时添加条件

时间:2015-04-20 03:22:27

标签: python web-scraping scrapy conditional-statements

我正在尝试抓取一个网页,似乎每个分隔都有不同的div,具体取决于用户支付的金额或网页的类型。

示例:

<div class="figuration Web company-stats">
..information i want to scrap..
</div>

<div class="figuration Commercial" >
..information i want to scrap..
</div>

它似乎有超过3种类型的div,所以我想知道是否有办法只选择包含第一个字形状的每个div

这是我的蜘蛛代码:

import scrapy

from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from pagina.items import PaginaItem
from scrapy.contrib.linkextractors import LinkExtractor


class MySpider(CrawlSpider):
    name = "pagina"
    allowed_domains = ["paginasamarillas.com.co"]
    start_urls = ["http://www.paginasamarillas.com.co/busqueda/bicicletas-medellin"]
    rules = (Rule(SgmlLinkExtractor( restrict_xpaths=('//ul[@class="paginator"]')), 
        callback='parse_item', follow=True, ),
)

    def parse_item(self, response):
        item = PaginaItem()

        for sel in response.xpath('//div[@class="figuration Web company-stats"]'):
            item = PaginaItem()
            item['nombre'] = sel.xpath('.//h2[@class="titleFig"]/a/text()').extract()
            #item['lugar'] = sel.xpath('.//div[@class="infoContact"]/div/h3/text()').extract()
            #item['numero'] = sel.xpath('.//div[@class="infoContact"]/span/text()').extract()
            #item['pagina'] = sel.xpath('.//div[@class="infoContact"]/a/@href').extract() 
            #item['sobre'] = sel.xpath('.//p[@class="CopyText"]/div/h3/text()').extract()
            yield item

2 个答案:

答案 0 :(得分:2)

使用CSS选择器:

for sel in response.css('div.figuration'):
    ...

答案 1 :(得分:1)

上面提到的CSS选择器可以使用,但是如果你想使用xpath选择器,你可以使用它:

for each in response.xpath('//div[contains(@class,"figuration")]'):
    ...

实际上,response.xpath('//div[contains(@class,"figuration")]')可以与response.css('div.figuration')

互换使用