我正在尝试在scrapy项目上设置代理。 我按照answer:
的说明进行了操作“1 - 创建一个名为”middlewares.py“的新文件并将其保存在scrapy项目中,并将以下代码添加到其中:”
[My code is :]
local function MWin()
game.StarterGui.ScreenGui1.DemonWin.Visible = true
if game.Workspace.Mages_Boss.Humanoid.Died:connect(function()
print("good")
end
要获取代理我正在使用来自https://proxy.webshare.io/
的免费提示提供端口,用户和地址:
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "USERNAME:PASSWORD"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
但是当我运行蜘蛛时,我收到以下错误:
import base64
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"
# Use the following lines if your proxy requires authentication
proxy_user_pass = "sarnencj:password"
# setup basic authentication for the proxy
encoded_user_pass = base64.encodestring(proxy_user_pass)
request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass
修改
设置中的中间件如下:
2018-04-30 21:44:30 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
完成日志
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
'moocs.middlewares.ProxyMiddleware': 100,
}
EDIT。
我尝试在蜘蛛类中设置代理:
2018-05-02 12:28:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: moocs)
2018-05-02 12:28:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2018-05-02 12:28:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'moocs.spiders', 'SPIDER_MODULES': ['moocs.spiders'], 'BOT_NAME': 'moocs'}
2018-05-02 12:28:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2018-05-02 12:28:39 [boto] DEBUG: Retrieving credentials from metadata server.
2018-05-02 12:28:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
r = opener.open(req, timeout=timeout)
File "/usr/lib/python2.7/urllib2.py", line 404, in open
response = self._open(req, data)
File "/usr/lib/python2.7/urllib2.py", line 422, in _open
'_open', req)
File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
result = func(*args)
File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
return self.do_open(httplib.HTTPConnection, req)
File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
raise URLError(err)
URLError: <urlopen error [Errno 101] Network is unreachable>
2018-05-02 12:28:40 [boto] ERROR: Unable to read instance data, giving up
2018-05-02 12:28:40 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapy/utils/deprecate.py:155: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead
ScrapyDeprecationWarning)
2018-05-02 12:28:40 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2018-05-02 12:28:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2018-05-02 12:28:40 [scrapy] INFO: Enabled item pipelines:
2018-05-02 12:28:40 [scrapy] INFO: Spider opened
2018-05-02 12:28:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-02 12:28:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-02 12:28:42 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:44 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] INFO: Closing spider (finished)
2018-05-02 12:28:45 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
'downloader/request_bytes': 909,
'downloader/request_count': 3,
'downloader/request_method_count/GET': 3,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2018, 5, 2, 16, 58, 45, 996708),
'log_count/DEBUG': 5,
'log_count/ERROR': 3,
'log_count/INFO': 7,
'log_count/WARNING': 1,
'scheduler/dequeued': 3,
'scheduler/dequeued/memory': 3,
'scheduler/enqueued': 3,
'scheduler/enqueued/memory': 3,
'start_time': datetime.datetime(2018, 5, 2, 16, 58, 40, 255414)}
2018-05-02 12:28:45 [scrapy] INFO: Spider closed (finished)
在middlewares.py中:
import scrapy
from scrapy import Request
from scrapy.loader import ItemLoader
from urlparse import urljoin
from moocs.items import MoocsItem,MoocsReviewItem
class MoocsSpiderSpider(scrapy.Spider):
name = "moocs_spider"
#allowed_domains = ["https://www.coursetalk.com/subjects/data-science/courses"]
start_urls = (
'https://www.coursetalk.com/subjects/data-science/courses',
)
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'moocs.middlewares.ProxyMiddleware': 100
}
}
def parse(self, response):
#print response.body#xpath()
courses_xpath = '//*[@class="course-listing-card"]//a[contains(@href, "/courses/")]/@href'
courses_url = [urljoin(response.url,relative_url) for relative_url in response.xpath(courses_xpath).extract()]
for course_url in courses_url[0:30]:
print course_url
yield Request(url=course_url, callback=self.parse_reviews)
现在我得到了一个不同的错误:
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"
EDIT2
我正在使用linux Mint 17.我没有在虚拟环境中安装scrapy。
来自“pip freeze”
2018-05-03 18:07:17 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: Could not open CONNECT tunnel.
2018-05-03 18:07:17 [scrapy] INFO: Closing spider (finished)
2018-05-03 18:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
'downloader/request_bytes': 245,
'downloader/request_count': 1,
'downloader/request_method_count/GET': 1,
'finish_reason': 'finished',
我跑:
Warning: cannot find svn location for apsw==3.8.2-r1
BeautifulSoup==3.2.1
CherryPy==3.2.2
EasyProcess==0.2.2
Flask==0.11.1
GDAL==2.1.0
GraphLab-Create==1.6.1
Jinja2==2.8
Mako==0.9.1
Markdown==2.4
MarkupSafe==0.18
PAM==0.4.2
Pillow==2.3.0
PyAudio==0.2.7
PyInstaller==2.1
PyVirtualDisplay==0.2
PyYAML==3.11
Pygments==2.0.2
Routes==2.0
SFrame==2.1
SQLAlchemy==0.8.4
Scrapy==1.0.3
Send2Trash==1.5.0
Shapely==1.5.17
Sphinx==1.2.2
Theano==0.8.2
Twisted==16.2.0
Twisted-Core==13.2.0
Twisted-Names==13.2.0
Twisted-Web==13.2.0
Werkzeug==0.11.10
adblockparser==0.7
## FIXME: could not find svn URL in dependency_links for this package:
apsw==3.8.2-r1
apt-xapian-index==0.45
apturl==0.4.1ubuntu4
argparse==1.2.1
backports-abc==0.4
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.4.1
bokeh==0.11.1
boto==2.41.0
branca==0.1.1
bz2file==0.98
captcha-solver==0.1.1
certifi==2015.9.6.2
characteristic==14.3.0
chardet==2.0.1
click==5.1
cloudpickle==0.2.1
colorama==0.2.5
command-not-found==0.3
configglue==1.1.2
cssselect==0.9.1
cssutils==0.9.10
cymem==1.31.2
debtagshw==0.1
decorator==4.0.2
defer==1.0.6
deluge==1.3.6
dirspec==13.10
dnspython==1.11.1
docutils==0.11
drawnow==0.71.1
duplicity==0.6.23
enum34==1.1.6
feedparser==5.1.3
folium==0.2.1
functools32==3.2.3-2
futures==3.0.5
gensim==0.13.1
geocoder==1.8.2
geolocation-python==0.2.2
geopandas==0.2.1
geopy==1.11.0
gmplot==1.1.1
googlemaps==2.4.2
gyp==0.1
html5lib==0.999
httplib2==0.8
ipykernel==4.0.3
ipython==4.0.0
ipython-genutils==0.1.0
ipywidgets==4.0.3
itsdangerous==0.24
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.2
jupyter-console==4.0.2
jupyter-core==4.4.0
jupyterlab==0.31.8
jupyterlab-launcher==0.10.5
lockfile==0.8
lxml==3.3.3
matplotlib==1.3.1
mechanize==0.2.5
mistune==0.7.1
mpmath==0.19
murmurhash==0.26.4
mysql-connector-python==1.1.6
nbconvert==4.0.0
nbformat==4.3.0
netifaces==0.8
nltk==3.2.1
nose==1.3.1
notebook==5.4.0
numpy==1.14.0
oauth2==1.9.0.post1
oauthlib==1.1.2
oneconf==0.3.7
opencage==1.1.4
pandas==0.22.0
paramiko==1.10.1
path.py==7.6
patsy==0.4.1
pexpect==3.1
pickleshare==0.5
piston-mini-client==0.7.5
plac==0.9.6
plotly==2.0.6
preshed==0.46.4
protobuf==2.5.0
psutil==5.0.1
psycopg2==2.4.5
ptyprocess==0.5
py==1.4.31
pyOpenSSL==0.13
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycrypto==2.6.1
pycups==1.9.66
pycurl==7.19.3
pygobject==3.12.0
pyinotify==0.9.4
pymongo==3.3.0
pyparsing==2.0.1
pyserial==2.7
pysmbc==1.0.14.1
pyspatialite==3.0.1
pysqlite==2.6.3
pytesseract==0.2.0
pytest==2.9.2
python-Levenshtein==0.12.0
python-apt==0.9.3.5
python-dateutil==2.6.1
python-debian==0.1.21-nmu2ubuntu2
python-libtorrent==0.16.13
pytz==2017.3
pyxdg==0.25
pyzmq==14.7.0
qt5reactor==0.3
qtconsole==4.0.1
queuelib==1.4.2
ratelim==0.1.6
reportlab==3.0
repoze.lru==0.6
requests==2.10.0
requests-oauthlib==0.6.2
roman==2.0.0
scikit-learn==0.17
scipy==0.17.1
scrapy-random-useragent==0.1
scrapy-splash==0.7.1
seaborn==0.7.0
selenium==2.53.6
semver==2.6.0
service-identity==14.0.0
sessioninstaller==0.0.0
shub==1.3.4
simpledbf==0.2.6
simplegeneric==0.8.1
simplejson==3.3.1
singledispatch==3.4.0.3
six==1.11.0
smart-open==1.3.3
smartystreets.py==0.2.4
spacy==0.101.0
sputnik==0.9.3
spyder==2.3.9
statsmodels==0.6.1
stevedore==0.14.1
subprocess32==3.2.7
sympy==1.0
system-service==0.1.6
terminado==0.8.1
tesseract==0.1.3
textblob==0.11.1
textrazor==1.2.2
thinc==5.0.8
tornado==4.3
traitlets==4.3.2
tweepy==3.3.0
uTidylib==0.2
urllib3==1.7.1
utils==0.9.0
vboxapi==1.0
vincent==0.4.4
virtualenv==15.0.2
virtualenv-clone==0.2.4
virtualenvwrapper==4.1.1
w3lib==1.12.0
wordcloud==1.2.1
wsgiref==0.1.2
yelp==1.0.2
zope.interface==4.0.5
是否有效并加载页面:
curl -v --proxy "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128" "https://www.coursetalk.com/subjects/data-science/courses" and see if it works or not
EDIT3
这是当前日志:
> Host: www.coursetalk.com:443
> Proxy-Authorization: Basic c2FybmVuY2otdXMtMTprZDk5NzIybDJrN3k=
> User-Agent: curl/7.35.0
> Proxy-Connection: Keep-Alive
>
< HTTP/1.1 200 Connection established
< Date: Fri, 04 May 2018 22:02:00 GMT
< Age: 0
< Transfer-Encoding: chunked
* CONNECT responded chunked
< Proxy-Connection: keep-alive
< Server: Webshare
<
* Proxy replied OK to CONNECT request
* successfully set certificate verify locations:
* CAfile: none
CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):
答案 0 :(得分:1)
我认为这个问题可能与您触及ProxyMiddleware
的顺序有关。我更新了你的代码并运行它如下
class Test(Spider):
name ="proxyapp"
start_urls = ["https://www.coursetalk.com/subjects/data-science/courses"]
custom_settings = {
'DOWNLOADER_MIDDLEWARES': {
'jobs.middlewares.ProxyMiddleware': 100
}
}
def parse(self, response):
print(response.text)
middlewares.py
class ProxyMiddleware(object):
# overwrite process request
def process_request(self, request, spider):
# Set the location of the proxy
request.meta['proxy'] = "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"
运行代码并且运行正常
我测试的scrapy版本低于
Scrapy==1.5.0
只是为了100%确定代理是否有效我在ipinfo.io/json
相信我,我不会坐在特拉华州甚至美国那件事
答案 1 :(得分:-1)
启用HttpProxyMiddleware并在请求元数据中传递代理网址。
<强>蜘蛛强>
DOWNLOADER_MIDDLEWARES = {
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 10,
}
设置强>
<?xml version="1.0" encoding="UTF-8"?>
<project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0
http://maven.apache.org/xsd/maven-4.0.0.xsd">
<modelVersion>4.0.0</modelVersion>
<groupId>test</groupId>
<artifactId>spark-kafka</artifactId>
<version>1.0-SNAPSHOT</version>
<repositories>
<repository>
<id>hortonworks</id>
<name>hortonworks repo</name>
<url>http://repo.hortonworks.com/content/repositories/releases/</url>
</repository>
</repositories>
<dependencies>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming-kafka -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming-kafka_2.10</artifactId>
<version>1.3.0</version>
</dependency>
<!-- https://mvnrepository.com/artifact/org.apache.spark/spark-streaming -->
<dependency>
<groupId>org.apache.spark</groupId>
<artifactId>spark-streaming_2.10</artifactId>
<version>1.6.0</version>
<scope>provided</scope>
</dependency>
</dependencies>
<build>
<defaultGoal>package</defaultGoal>
<resources>
<resource>
<directory>src/main/resources</directory>
<filtering>true</filtering>
</resource>
<resource>
<directory>src/test/resources</directory>
<filtering>true</filtering>
</resource>
</resources>
<plugins>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-resources-plugin</artifactId>
<configuration>
<encoding>UTF-8</encoding>
</configuration>
<executions>
<execution>
<goals>
<goal>copy-resources</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>net.alchim31.maven</groupId>
<artifactId>scala-maven-plugin</artifactId>
<version>3.2.0</version>
<configuration>
<recompileMode>incremental</recompileMode>
<args>
<arg>-target:jvm-1.7</arg>
</args>
<javacArgs>
<javacArg>-source</javacArg>
<javacArg>1.7</javacArg>
<javacArg>-target</javacArg>
<javacArg>1.7</javacArg>
</javacArgs>
</configuration>
<executions>
<execution>
<id>scala-compile</id>
<phase>process-resources</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
<execution>
<id>scala-test-compile</id>
<phase>process-test-resources</phase>
<goals>
<goal>testCompile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-compiler-plugin</artifactId>
<configuration>
<source>1.7</source>
<target>1.7</target>
</configuration>
<executions>
<execution>
<phase>compile</phase>
<goals>
<goal>compile</goal>
</goals>
</execution>
</executions>
</plugin>
<plugin>
<groupId>org.apache.maven.plugins</groupId>
<artifactId>maven-shade-plugin</artifactId>
<version>2.3</version>
<executions>
<execution>
<phase>package</phase>
<goals>
<goal>shade</goal>
</goals>
</execution>
</executions>
<configuration>
<filters>
<filter>
<artifact>*:*</artifact>
<excludes>
<exclude>META-INF/*.SF</exclude>
<exclude>META-INF/*.DSA</exclude>
<exclude>META-INF/*.RSA</exclude>
</excludes>
</filter>
</filters>
<finalName>uber-${project.artifactId}-${project.version}</finalName>
</configuration>
</plugin>
</plugins>
</build>