我有一个基本的python脚本,应该在phonearena上查找手机,我这样初始化它:
class PASpider(scrapy.Spider):
name = "pabot"
allowed_domains = ["http://www.phonearena.com/"]
start_urls = ["http://www.phonearena.com/phones"]
# Initialize the bot, takes a device name
def __init__(self):
device = "Nexus 6"
words = nltk.word_tokenize(device)
query = "http://www.phonearena.com/phones/word/"
for word in words:
query += word.lower()+"%20"
query = query[0:len(query)-3]
self.start_urls = [query]
到目前为止一直都很好,但是当我试图访问手机页面时,我收到了针对X错误的过滤异地请求,这通常应该是因为它在允许的域之外,但我无法弄明白。这是提取链接的代码,以及控制台输出:
def parse_search(self,response):
self.log(Fore.RED + Style.BRIGHT + "Web-spider started." + Fore.RESET + Style.RESET_ALL, level=log.INFO)
self.log(Fore.GREEN + Style.BRIGHT + "type: " + str(type(response)) + Fore.RESET + Style.RESET_ALL, level=log.INFO)
device = Device()
target = Selector(response=response).xpath('//a[re:test(@class, "s_thumb")]//@href').extract()
self.log(Fore.WHITE + Style.BRIGHT + "Target link: " + target[0] + Fore.RESET + Style.RESET_ALL, level=log.INFO)
return scrapy.Request('http://www.phonearena.com'+target[0], callback=self.parse_item)
http://i.imgur.com/hoWUaxT.png(没有代表发布图片)
知道可能导致这种情况的原因吗?
编辑:谢谢@alecxe,我不得不使用allowed_domains = [" phonearena.com"]。