无论如何,我可以控制我的crawlspider,以便它不会爬出我在start_urls
列表中指定的原始域之外?
我尝试了下面的内容,但它对我不起作用:(:
import os
from scrapy.selector import Selector
from scrapy.contrib.exporter import CsvItemExporter
from scrapy.item import Item, Field
from scrapy.settings import Settings
from scrapy.settings import default_settings
from selenium import webdriver
from urlparse import urlparse
import csv
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
default_settings.DEPTH_LIMIT = 3
DOWNLOADER_MIDDLEWARES = {
'grimes2.middlewares.CustomDownloaderMiddleware': 543,
'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None
}
有人能帮帮我吗?谢谢。
答案 0 :(得分:1)
allowed_domains 的 包含允许此爬网爬行的域的字符串的可选列表。如果启用OffsiteMiddleware,则不会遵循不属于此列表中指定的域名的URL的请求。
了解它在scrapy tutorial
中的用法from scrapy.spider import BaseSpider
class DmozSpider(BaseSpider):
name = "dmoz"
allowed_domains = ["dmoz.org"]
start_urls = [
"http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
"http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
]
def parse(self, response):
filename = response.url.split("/")[-2]
open(filename, 'wb').write(response.body)