如何在scrapy中挖掘网站wiithin链接

时间:2014-06-11 13:41:09

标签: python web-scraping scrapy web-crawler

我为scrapy编写了一个python类:

from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector

class MyItem(Item):
    content = Field()

class TestSpider(Spider):
    name = 'test_spider'
    allowed_domains = ['www.hamshahrionline.ir']
    start_urls = ['http://www.hamshahrionline.ir/']

    def parse(self, response):
        sel = Selector(response)
        h4 = sel.xpath("//h4/a/text()").extract()

    for t4 in h4:
            title4 = MyItem()
            title4['content'] = t4
            yield title4

我想知道如何挖掘这些内容的链接并浏览其他页面?

第二个问题:

您是否可以告诉我如何从网站上逐页查看链接内容?

1 个答案:

答案 0 :(得分:2)

您需要使用CrawlSpider而不是常规的Spider课程。它支持RulesLinkExtractors的概念,可以提取链接并遵循它们。

示例(在所有内部有service/\w+的链接之后):

from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field


class MyItem(Item):
    content = Field()


class TestSpider(CrawlSpider):
    name = 'test_spider'
    allowed_domains = ['hamshahrionline.ir']
    start_urls = ['http://www.hamshahrionline.ir']

    rules = (
        Rule(SgmlLinkExtractor(allow=('service/\w+', ), ), callback='parse_item'),
    )

    def parse_item(self, response):
        print response.url

        item = MyItem()
        item['content'] = response.body
        return item