scrapy规则提取器不使用代理

时间:2016-05-19 09:57:05

标签: python proxy web-scraping scrapy http-proxy

我对链接提取器的规则是这样的,这是用于分页

  ...
  var source = File
    .ReadLines("input.txt") // Notice absence of "All", not ReadAllLines
    .Select(line => line.Split(' ')) // You don't need Regex here, just Split 
    .Select(items => items
      .Select(item => String.Equals(item, term, StringComparison.OrdinalIgnoreCase) 
         ? @"<strong>" + term + @"</strong>" 
         : item))
    .Select(items => String.Join(" ", items));

  File.WriteAllLines("output.txt", source);

我在这里提出请求

rules = (
        Rule(SgmlLinkExtractor(allow=(), restrict_xpaths=('//a[@class="button next"]',)), callback="parse_items", follow= True),
    )

但是如果我使用代理,则规则提取器不会迭代,如果我删除def start_requests(self): CrawlSpider.start_requests(self) in_url = self.base_url + "Ahmedabad" print "******************req_url********************",in_url req = Request(in_url, dont_filter=True) req.meta['proxy']="http://52.71.9.25:8080" yield req 它正常工作。

请帮我解决这个问题。

0 个答案:

没有答案