scrapy:防止crawlspider在/ facebook网站中抓取链接

时间:2013-12-08 20:52:55

标签: python scrapy

无论如何,我可以控制我的crawlspider,以便它不会爬出我在start_urls列表中指定的原始域之外? 我尝试了下面的内容,但它对我不起作用:(:

import os
from scrapy.selector import Selector
from scrapy.contrib.exporter import CsvItemExporter
from scrapy.item import Item, Field
from scrapy.settings import Settings
from scrapy.settings import default_settings 
from selenium import webdriver
from urlparse import urlparse
import csv    
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log

default_settings.DEPTH_LIMIT = 3
DOWNLOADER_MIDDLEWARES = {
                'grimes2.middlewares.CustomDownloaderMiddleware': 543,
                'scrapy.contrib.downloadermiddleware.redirect.RedirectMiddleware': None
                     }
有人能帮帮我吗?谢谢。

1 个答案:

答案 0 :(得分:1)

  

allowed_domains 的   包含允许此爬网爬行的域的字符串的可选列表。如果启用OffsiteMiddleware,则不会遵循不属于此列表中指定的域名的URL的请求。

了解它在scrapy tutorial

中的用法
from scrapy.spider import BaseSpider

class DmozSpider(BaseSpider):
    name = "dmoz"
    allowed_domains = ["dmoz.org"]
    start_urls = [
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/",
        "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/"
    ]

    def parse(self, response):
        filename = response.url.split("/")[-2]
        open(filename, 'wb').write(response.body)