我用proxybroker获得了代理列表。
sudo pip install proxybroker
proxybroker grab --countries US --limit 100 --outfile proxies.txt
使用grep。
从格式<Proxy US 0.00s [] 104.131.6.78:80>
更改为104.131.6.78:80
grep -oP \([0-9]+.\){3}[0-9]+:[0-9]+ proxies.txt > proxy.csv
proxy.csv中的所有代理,格式如下。
cat proxy.csv
104.131.6.78:80
104.197.16.8:3128
104.131.94.221:8080
63.110.242.67:3128
我根据网页编写了我的抓取工具 Multiple Proxies
这是我的框架结构--test.py。
import scrapy,urllib.request
import os,csv
class TestSpider(scrapy.Spider):
name = "quotes"
allowed_domains = ["xxxx.com"]
def __init__(self, *args, **kw):
self.timeout = 10
csvfile = open('proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader]
self.proxy_pool = ippool
def start_requests(self):
yield scrapy.Request(url , callback=self.parse)
def get_request(self, url):
req = Request(url=url)
if self.proxy_pool:
req.meta['proxy'] = random.choice(self.proxy_pool)
return req
def parse(self, response):
do something
使用scrapy runspider test.py
另一方拒绝了连接:111:拒绝连接。
使用来自proxybroker
的相同代理,我使用自己的方式下载网址集而不是scrapy。
为简单起见,所有损坏的代理IP仍然保留而不是被删除
以下代码片段用于测试是否可以使用代理IP而不是完全下载URL设置
程序结构如下。
import time
import csv,os,urllib.request
data_dir = "/tmp/"
urls = set #omit how to get it.
csvfile = open(data_dir + 'proxy.csv')
reader = csv.reader(csvfile)
ippool = [row[0] for row in reader]
ip_len = len(ippool)
ipth = 0
for ith,item in enumerate(urls):
time.sleep(2)
flag = 1
if ipth >= ip_len : ipth =0
while(ipth <ip_len and flag == 1):
try :
handler = urllib.request.ProxyHandler({'http':ippool[ipth]})
opener = urllib.request.build_opener(handler)
urllib.request.install_opener(opener)
response = urllib.request.urlopen(urls[ith]).read().decode("utf8")
fh = open(data_dir + str(ith),"w")
fh.write(response)
fh.close()
ipth = ipth + 1
flag = 0
print(urls[ith] + "downloaded")
except :
print("can not downloaded" + urls[ith])
可以使用proxybroker
抓取的代理下载许多网址
很明显:
proxybroker
抓取的许多代理IP,其中许多是免费且稳定的。 如何修复我的scrapy中的错误?
答案 0 :(得分:0)
尝试使用scrapy-proxies
在Settings.py
中,您可以进行以下更改:
# Retry many times since proxies often fail
RETRY_TIMES = 10
# Retry on most error codes since proxies fail for different reasons
RETRY_HTTP_CODES = [500, 503, 504, 400, 403, 404, 408]
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.retry.RetryMiddleware': 90,
'scrapy_proxies.RandomProxy': 100,
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 110,
}
# Proxy list containing entries like
# http://host1:port
# http://username:password@host2:port
# http://host3:port
# ...
PROXY_LIST = '/path/to/proxy/list.txt'
# Proxy mode
# 0 = Every requests have different proxy
# 1 = Take only one proxy from the list and assign it to every requests
# 2 = Put a custom proxy to use in the settings
PROXY_MODE = 0
# If proxy mode is 2 uncomment this sentence :
#CUSTOM_PROXY = "http://host1:port"
希望这对你有帮助,因为这也解决了我的问题。