下一页和scrapy爬虫不起作用

时间:2014-05-17 22:33:20

标签: python scrapy

我试图关注this website上的页面,其中下一页编号生成非常奇怪。而不是正常的索引,下一页看起来像这样:

new/v2.php?cat=69&pnum=2&pnum=3
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4
new/v2.php?cat=69&pnum=2&pnum=3&pnum=4&pnum=5

因此我的刮刀进入循环并且永不停止,从这种页面中抓取物品:

DEBUG: Scraped from <200 http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=1&pnum=1&pnum=2&pnum=3>`

等等。 虽然抓取的项目是正确的并且与目标匹配,但抓取工具永远不会停止,再次寻找页面。

我的抓取工具看起来像这样:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


from mymobile.items import MymobileItem


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    rules = (Rule(SgmlLinkExtractor(allow=("new/v2.php\?cat=69&pnum=\d*", ))
            , callback="parse_items", follow=True),)

    def parse_items(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            items.append(item)

        return(items)   

任何建议如何驯服它?

2 个答案:

答案 0 :(得分:4)

据我了解。所有页码都显示在您的起始网址http://mymobile.ge/new/v2.php?cat=69&pnum=1中,因此您可以使用follow=False,该规则只会执行一次,但会在第一次传递中提取所有链接。

我尝试过:

from scrapy.item import Item, Field
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.selector import Selector
import re
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from urlparse import urljoin


class MmobySpider(CrawlSpider):
    name = "mmoby2" 
    allowed_domains = ["mymobile.ge"]
    start_urls = [ 
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]   

    rules = ( 
        Rule(SgmlLinkExtractor(
                allow=("new/v2\.php\?cat=69&pnum=\d*",),
            )   
            , callback="parse_items", follow=False),)

    def parse_items(self, response):
        sel = Selector(response)
        print response.url

像它一样:

scrapy crawl mmoby2

请求计数为6,输出如下:

...
2014-05-18 12:20:35+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: None)
2014-05-18 12:20:36+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1
2014-05-18 12:20:37+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=4
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=2
2014-05-18 12:20:38+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=5
2014-05-18 12:20:39+0200 [mmoby2] DEBUG: Crawled (200) <GET http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3> (referer: http://mymobile.ge/new/v2.php?cat=69&pnum=1)
http://mymobile.ge/new/v2.php?cat=69&pnum=1&pnum=3
2014-05-18 12:20:39+0200 [mmoby2] INFO: Closing spider (finished)
2014-05-18 12:20:39+0200 [mmoby2] INFO: Dumping Scrapy stats:
        {'downloader/request_bytes': 1962,
         'downloader/request_count': 6,
         'downloader/request_method_count/GET': 6,
         ...

答案 1 :(得分:4)

如果使用Smgllinkextractor提取链接失败,您可以始终使用简单的scrapy spider并使用selectors / xpaths提取下一页的链接,然后在没有下一页链接的情况下生成具有回调的下一页请求以解析并停止进程。

这样的事情对你有用。

from scrapy.spider import Spider
from scrapy.http import Request

class MmobySpider(Spider):
    name = "mmoby2"
    allowed_domains = ["mymobile.ge"]
    start_urls = [
        "http://mymobile.ge/new/v2.php?cat=69&pnum=1"
    ]

    def parse(self, response):
        sel = Selector(response)
        titles = sel.xpath('//table[@width="1000"]//td/table[@class="probg"]')
        items = []
        for t in titles:
            url = t.xpath('tr//a/@href').extract()
            item = MymobileItem()
            item["brand"] = t.xpath('tr[2]/td/text()').re('^([\w\-]+)')
            item["model"] = t.xpath('tr[2]/td/text()').re('\s+(.*)$')
            item["price"] = t.xpath('tr[3]/td//text()').re('^([0-9\.]+)')
            item["url"] = urljoin("http://mymobile.ge/new/", url[0])

            yield item

        # extract next page link
        next_page_xpath = "//td[span]/following-sibling::td[1]/a[contains(@href, 'num')]/@href"
        next_page = sel.xpath(next_page_xpath).extract()

        # if there is next page yield Request for it
        if next_page:
            next_page = urljoin(response.url, next_page[0])
            yield Request(next_page, callback=self.parse)

由于页面完全没有语义标记,下一页的Xpath并不容易,但它应该可以正常工作。