即使我的链接在允许的域内,Scrapy也会给我异地请求

时间:2014-12-29 16:26:59

标签: python python-2.7 web-scraping scrapy

我有一个基本的python脚本,应该在phonearena上查找手机,我这样初始化它:

class PASpider(scrapy.Spider):
    name = "pabot"
    allowed_domains = ["http://www.phonearena.com/"]
    start_urls = ["http://www.phonearena.com/phones"]

    # Initialize the bot, takes a device name
    def __init__(self):
        device = "Nexus 6"
        words = nltk.word_tokenize(device)
        query = "http://www.phonearena.com/phones/word/"

        for word in words:
            query += word.lower()+"%20"

        query = query[0:len(query)-3]
        self.start_urls = [query]

到目前为止一直都很好,但是当我试图访问手机页面时,我收到了针对X错误的过滤异地请求,这通常应该是因为它在允许的域之外,但我无法弄明白。这是提取链接的代码,以及控制台输出:

def parse_search(self,response):
        self.log(Fore.RED + Style.BRIGHT + "Web-spider started." + Fore.RESET + Style.RESET_ALL, level=log.INFO)
        self.log(Fore.GREEN + Style.BRIGHT + "type: " + str(type(response)) + Fore.RESET + Style.RESET_ALL, level=log.INFO)

        device = Device()

        target = Selector(response=response).xpath('//a[re:test(@class, "s_thumb")]//@href').extract()
        self.log(Fore.WHITE + Style.BRIGHT + "Target link: " + target[0] + Fore.RESET + Style.RESET_ALL, level=log.INFO)

        return scrapy.Request('http://www.phonearena.com'+target[0], callback=self.parse_item)

http://i.imgur.com/hoWUaxT.png(没有代表发布图片)

知道可能导致这种情况的原因吗?

编辑:谢谢@alecxe,我不得不使用allowed_domains = [" phonearena.com"]。

0 个答案:

没有答案