Question

使用以下蜘蛛从此网站搜索多个字段。我遇到的问题是我得到的网址是适用于页面上所有16个模型的网址，然后是另一个网址，再次应用于16个模型。我无法用url xpath确定问题。你能指出我在url xpath中有哪些缺陷？谢谢。附：其他领域工作得很好并且匹配。缺少价格字段是缺货模型。

class ZoomSpider(CrawlSpider):
name = "zoom2"
allowed_domains = ["zoomer.ge"]
start_urls = [
    "http://zoomer.ge/index.php?cid=35&act=search&category=1&search_type=mobile"
]

rules = (Rule (SgmlLinkExtractor(allow=("index.php\?cid=35&act=search&category=1&search_type=mobile&page=\d*", )) 
        , callback="parse_items", follow=True),)


def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//div[@class="productContainer"]/div[5]/div[@class="productListContainer"]')
        items = []
        for t in titles:
        item = ZoomerItem()
            url = sel.xpath('//div[@class="productListImage"]/a/@href').extract()
            item["brand"] = t.xpath('div[3]/text()').re('^([\w\-]+)')
            item["price"] = t.xpath('div[@class="productListPrice"]/div/text()').extract()
            item["model"] = t.xpath('div[3]/text()').re('\s+(.*)$')[0].strip()
            item["url"] = urljoin("http://zoomer.ge", url[0])

            items.append(item)

        return(items)

enter image description here

Answer 1

您需要使用相对xpath，使用xpath，您总是会在每个应该使用的页面上获得第一个链接：

t.xpath('.//div[@class="productListImage"]/a/@href').extract()

注意开头的那个点。 Xpath应该是相对于特定的选择器，在你的情况下这是＆＃39; t＆＃39;在for循环中。

这是一个很常见的错误，it's described in scrapy docs

而不是匹配，在scrapy中获得相同的url

1 个答案: