我有一个像这样的代理列表,我想在使用python进行抓取时使用
proxies_ls = [ '149.56.89.166:3128',
'194.44.176.116:8080',
'14.203.99.67:8080',
'185.87.65.204:63909',
'103.206.161.234:63909',
'110.78.177.100:65103']
并创建了一个函数,以便使用bs4废弃url并请求名为crawlSite(url)的模块。这是代码:
# Bibliotecas para crawl e regex
from bs4 import BeautifulSoup
import requests
from fake_useragent import UserAgent
import re
#Biblioteca para data
import datetime
from time import gmtime, strftime
#Biblioteca para escrita dos logs
import os
import errno
#Biblioteca para delay aleatorio
import time
import random
print('BOT iniciado: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S'))
proxies_ls = [ '149.56.89.166:3128',
'194.44.176.116:8080',
'14.203.99.67:8080',
'185.87.65.204:63909',
'103.206.161.234:63909',
'110.78.177.100:65103']
def crawlSite(url):
#Chrome emulation
ua=UserAgent()
header={'user-agent':ua.chrome}
random.shuffle(proxies_ls)
#Random delay
print('antes do delay: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S'))
tempoRandom=random.randint(1,5)
time.sleep(tempoRandom)
try:
randProxy=random.choice(proxies_ls)
# Getting the webpage, creating a Response object emulated with chrome with a 30sec timeout.
response = requests.get(url,proxies = {'https':randProxy},headers=header,timeout=30)
print(response)
print('Resposta obtida: '+ datetime.datetime.now().strftime('%d-%m-%Y %H:%M:%S'))
#Avoid HTTP request errors
if response.status_code == 404:
raise ConnectionError("HTTP Response [404] - The requested resource could not be found")
elif response.status_code == 409:
raise ConnectionError("HTTP Response [409] - Possible Cloudflare DNS resolution error")
elif response.status_code == 403:
raise ConnectionError("HTTP Response [403] - Permission denied error")
elif response.status_code == 503:
raise ConnectionError("HTTP Response [503] - Service unavailable error")
print('RR Status {}'.format(response.status_code))
# Extracting the source code of the page.
data = response.text
except ConnectionError:
try:
proxies_ls.remove(randProxy)
except ValueError:
pass
randProxy=random.choice(proxies_ls)
return BeautifulSoup(data, 'lxml')
我想要做的是确保在该列表中仅使用该列表中的代理。 随机部分
randProxy=random.choice(proxies_ls)
工作正常但检查部分代理是否有效则不然。主要是因为我仍然通过“组合代理”得到200作为回应。
如果我将列表缩减为:
proxies_ls = ['149.56.89.166:3128']
使用不起作用的代理我仍然得到200响应! (我尝试使用像https://pt.infobyip.com/proxychecker.php这样的代理检查器,但它不起作用......)
所以我的问题是(我会列举所以它更容易): a)为什么我得到200响应而不是4xx响应? b)如何根据需要强制请求使用代理?
谢谢,
Eunito。
答案 0 :(得分:0)
仔细阅读文档,您必须在字典中指定以下内容:
http://docs.python-requests.org/en/master/user/advanced/#proxies
“工作”字典应如下所示:
proxies = {
'https': 'socks5://localhost:9050'
}
这将仅代理所有 https 请求。这意味着它不会代理 http 。
因此,为了代理所有webtraffic,你应该像这样配置你的dict:
proxies = {
'https': 'socks5://localhost:9050'
'http': 'socks5://localhost:9050'
}
当然,并在必要时替换IP地址。请参阅以下示例,了解其他情况:
$ python
>>> import requests
>>> proxies = {'https':'http://149.58.89.166:3128'}
>>> # Get a HTTP page (this goes around the proxy)
>>> response = requests.get("http://www.example.com/",proxies=proxies)
>>> response.status_code
200
>>> # Get a HTTPS page (so it goes through the proxy)
>>> response = requests.get("https://www.example.com/", proxies=proxies)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 70, in get
return request('get', url, params=params, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/api.py", line 56, in request
return session.request(method=method, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 488, in request
resp = self.send(prep, **send_kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 609, in send
r = adapter.send(request, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/requests/adapters.py", line 485, in send
raise ProxyError(e, request=request)
requests.exceptions.ProxyError: HTTPSConnectionPool(host='www.example.com', port=443): Max retries exceeded with url: / (Caused by ProxyError('Cannot connect to proxy.', NewConnectionError('<requests.packages.urllib3.connection.VerifiedHTTPSConnection object at 0x7f7d1f448c10>: Failed to establish a new connection: [Errno 110] Connection timed out',)))
答案 1 :(得分:0)
所以基本上,如果我的问题是正确的,你只想检查代理是否有效。 $file = "your-file.xls";
$handle = fopen($file, "r");
$c = 0;
while(($filesop = fgetcsv($handle, 1000, ",")) !== false)
{
$name = $filesop[0];
$email = $filesop[1];
$sql = mysql_query("INSERT INTO xls (name, email) VALUES ('$name','$email')");
}
if($sql){
echo "You database has imported successfully";
}else{
echo "Sorry! There is some problem.";
}
有一个异常处理程序,你可以这样做:
requests