如何在scrapy中使用代理文件?

时间:2017-10-27 11:34:04

标签: proxy scrapy http-proxy

我用proxybroker获得了代理列表。

sudo pip install proxybroker
proxybroker grab --countries US --limit 100 --outfile proxies.txt

使用grep。

从格式<Proxy US 0.00s [] 104.131.6.78:80>更改为104.131.6.78:80
 grep -oP  \([0-9]+.\){3}[0-9]+:[0-9]+   proxies.txt   > proxy.csv

proxy.csv中的所有代理,格式如下。

cat proxy.csv
104.131.6.78:80
104.197.16.8:3128
104.131.94.221:8080
63.110.242.67:3128

我根据网页编写了我的抓取工具 Multiple Proxies

这是我的框架结构--test.py。

import scrapy,urllib.request
import os,csv

class TestSpider(scrapy.Spider):
    name = "quotes"
    allowed_domains = ["xxxx.com"]

    def __init__(self, *args, **kw):
        self.timeout = 10
        csvfile = open('proxy.csv')
        reader = csv.reader(csvfile)
        ippool = [row[0] for row in reader]
        self.proxy_pool =  ippool  

    def start_requests(self):
        yield scrapy.Request(url , callback=self.parse)

    def get_request(self, url):
        req = Request(url=url)
        if self.proxy_pool:
            req.meta['proxy'] = random.choice(self.proxy_pool)
        return req

    def parse(self, response):
        do something

使用scrapy runspider test.py

运行蜘蛛时会出现错误信息

另一方拒绝了连接:111:拒绝连接。

使用来自proxybroker的相同代理,我使用自己的方式下载网址集而不是scrapy。
为简单起见,所有损坏的代理IP仍然保留而不是被删除 以下代码片段用于测试是否可以使用代理IP而不是完全下载URL设置 程序结构如下。

import time
import csv,os,urllib.request
data_dir = "/tmp/"

urls = set #omit how to get it.

csvfile = open(data_dir + 'proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader] 
ip_len = len(ippool)
ipth = 0 

for ith,item in enumerate(urls):
    time.sleep(2)
    flag = 1
    if ipth >= ip_len : ipth =0 
    while(ipth <ip_len and flag == 1):
        try : 
            handler = urllib.request.ProxyHandler({'http':ippool[ipth]})  
            opener = urllib.request.build_opener(handler)
            urllib.request.install_opener(opener)  
            response = urllib.request.urlopen(urls[ith]).read().decode("utf8")
            fh = open(data_dir + str(ith),"w")
            fh.write(response)
            fh.close()
            ipth = ipth + 1 
            flag = 0
            print(urls[ith] + "downloaded")
        except :
            print("can not downloaded" + urls[ith]) 

可以使用proxybroker抓取的代理下载许多网址 很明显:

  1. 可以使用proxybroker抓取的许多代理IP,其中许多是免费且稳定的。
  2. 我的scrapy代码中的一些错误。
  3. 如何修复我的scrapy中的错误?

1 个答案:

答案 0 :(得分:0)

尝试使用scrapy-proxies

Settings.py中,您可以进行以下更改:

# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
    'scrapy_proxies.RandomProxy': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}

# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'

# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0

# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"

希望这对你有帮助,因为这也解决了我的问题。