我为scrapy编写了一个python类:
from scrapy.item import Item, Field
from scrapy.spider import Spider
from scrapy.selector import Selector
class MyItem(Item):
content = Field()
class TestSpider(Spider):
name = 'test_spider'
allowed_domains = ['www.hamshahrionline.ir']
start_urls = ['http://www.hamshahrionline.ir/']
def parse(self, response):
sel = Selector(response)
h4 = sel.xpath("//h4/a/text()").extract()
for t4 in h4:
title4 = MyItem()
title4['content'] = t4
yield title4
我想知道如何挖掘这些内容的链接并浏览其他页面?
第二个问题:
您是否可以告诉我如何从网站上逐页查看链接内容?
答案 0 :(得分:2)
您需要使用CrawlSpider
而不是常规的Spider
课程。它支持Rules
和LinkExtractors
的概念,可以提取链接并遵循它们。
示例(在所有内部有service/\w+
的链接之后):
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.item import Item, Field
class MyItem(Item):
content = Field()
class TestSpider(CrawlSpider):
name = 'test_spider'
allowed_domains = ['hamshahrionline.ir']
start_urls = ['http://www.hamshahrionline.ir']
rules = (
Rule(SgmlLinkExtractor(allow=('service/\w+', ), ), callback='parse_item'),
)
def parse_item(self, response):
print response.url
item = MyItem()
item['content'] = response.body
return item