Question

我正在尝试在scrapy项目上设置代理。我按照answer：

的说明进行了操作

“1 - 创建一个名为”middlewares.py“的新文件并将其保存在scrapy项目中，并将以下代码添加到其中：”

[My code is :]

local function MWin()
game.StarterGui.ScreenGui1.DemonWin.Visible = true
if game.Workspace.Mages_Boss.Humanoid.Died:connect(function()
print("good")
end

要获取代理我正在使用来自https://proxy.webshare.io/

的免费提示

提供端口，用户和地址：

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] = "http://YOUR_PROXY_IP:PORT"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "USERNAME:PASSWORD"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

但是当我运行蜘蛛时，我收到以下错误：

import base64
class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"

        # Use the following lines if your proxy requires authentication
        proxy_user_pass = "sarnencj:password"
        # setup basic authentication for the proxy
        encoded_user_pass = base64.encodestring(proxy_user_pass)
        request.headers['Proxy-Authorization'] = 'Basic ' + encoded_user_pass

修改

设置中的中间件如下：

2018-04-30 21:44:30 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]

完成日志

DOWNLOADER_MIDDLEWARES = {
    'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 110,
    'moocs.middlewares.ProxyMiddleware': 100,
}

EDIT。

我尝试在蜘蛛类中设置代理：

2018-05-02 12:28:38 [scrapy] INFO: Scrapy 1.0.3 started (bot: moocs)
2018-05-02 12:28:38 [scrapy] INFO: Optional features available: ssl, http11, boto
2018-05-02 12:28:38 [scrapy] INFO: Overridden settings: {'NEWSPIDER_MODULE': 'moocs.spiders', 'SPIDER_MODULES': ['moocs.spiders'], 'BOT_NAME': 'moocs'}
2018-05-02 12:28:39 [scrapy] INFO: Enabled extensions: CloseSpider, TelnetConsole, LogStats, CoreStats, SpiderState
2018-05-02 12:28:39 [boto] DEBUG: Retrieving credentials from metadata server.
2018-05-02 12:28:39 [boto] ERROR: Caught exception reading instance data
Traceback (most recent call last):
  File "/usr/local/lib/python2.7/dist-packages/boto/utils.py", line 210, in retry_url
    r = opener.open(req, timeout=timeout)
  File "/usr/lib/python2.7/urllib2.py", line 404, in open
    response = self._open(req, data)
  File "/usr/lib/python2.7/urllib2.py", line 422, in _open
    '_open', req)
  File "/usr/lib/python2.7/urllib2.py", line 382, in _call_chain
    result = func(*args)
  File "/usr/lib/python2.7/urllib2.py", line 1214, in http_open
    return self.do_open(httplib.HTTPConnection, req)
  File "/usr/lib/python2.7/urllib2.py", line 1184, in do_open
    raise URLError(err)
URLError: <urlopen error [Errno 101] Network is unreachable>
2018-05-02 12:28:40 [boto] ERROR: Unable to read instance data, giving up
2018-05-02 12:28:40 [py.warnings] WARNING: /usr/local/lib/python2.7/dist-packages/scrapy/utils/deprecate.py:155: ScrapyDeprecationWarning: `scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware` class is deprecated, use `scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware` instead
  ScrapyDeprecationWarning)

2018-05-02 12:28:40 [scrapy] INFO: Enabled downloader middlewares: ProxyMiddleware, HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats
2018-05-02 12:28:40 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware
2018-05-02 12:28:40 [scrapy] INFO: Enabled item pipelines: 
2018-05-02 12:28:40 [scrapy] INFO: Spider opened
2018-05-02 12:28:40 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2018-05-02 12:28:40 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2018-05-02 12:28:42 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 1 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:44 [scrapy] DEBUG: Retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 2 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] DEBUG: Gave up retrying <GET https://www.coursetalk.com/subjects/data-science/courses> (failed 3 times): [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: [<twisted.python.failure.Failure OpenSSL.SSL.Error: [('SSL routines', 'SSL3_READ_BYTES', 'ssl handshake failure')]>]
2018-05-02 12:28:45 [scrapy] INFO: Closing spider (finished)
2018-05-02 12:28:45 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 3,
 'downloader/exception_type_count/twisted.web._newclient.ResponseNeverReceived': 3,
 'downloader/request_bytes': 909,
 'downloader/request_count': 3,
 'downloader/request_method_count/GET': 3,
 'finish_reason': 'finished',
 'finish_time': datetime.datetime(2018, 5, 2, 16, 58, 45, 996708),
 'log_count/DEBUG': 5,
 'log_count/ERROR': 3,
 'log_count/INFO': 7,
 'log_count/WARNING': 1,
 'scheduler/dequeued': 3,
 'scheduler/dequeued/memory': 3,
 'scheduler/enqueued': 3,
 'scheduler/enqueued/memory': 3,
 'start_time': datetime.datetime(2018, 5, 2, 16, 58, 40, 255414)}
2018-05-02 12:28:45 [scrapy] INFO: Spider closed (finished)

在middlewares.py中：

import scrapy
from scrapy import  Request
from scrapy.loader import ItemLoader

from urlparse import urljoin 
from moocs.items import MoocsItem,MoocsReviewItem



class MoocsSpiderSpider(scrapy.Spider):
    name = "moocs_spider"
    #allowed_domains = ["https://www.coursetalk.com/subjects/data-science/courses"]
    start_urls = (
        'https://www.coursetalk.com/subjects/data-science/courses',
    )

    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'moocs.middlewares.ProxyMiddleware': 100
        }
    }
    def parse(self, response):
        #print response.body#xpath()
        courses_xpath = '//*[@class="course-listing-card"]//a[contains(@href, "/courses/")]/@href'
        courses_url = [urljoin(response.url,relative_url)  for relative_url in response.xpath(courses_xpath).extract()]  
        for course_url in courses_url[0:30]:
            print course_url
            yield Request(url=course_url, callback=self.parse_reviews)

现在我得到了一个不同的错误：

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"

EDIT2

我正在使用linux Mint 17.我没有在虚拟环境中安装scrapy。

来自“pip freeze”

2018-05-03 18:07:17 [scrapy] ERROR: Error downloading <GET https://www.coursetalk.com/subjects/data-science/courses>: Could not open CONNECT tunnel.
2018-05-03 18:07:17 [scrapy] INFO: Closing spider (finished)
2018-05-03 18:07:17 [scrapy] INFO: Dumping Scrapy stats:
{'downloader/exception_count': 1,
 'downloader/exception_type_count/scrapy.core.downloader.handlers.http11.TunnelError': 1,
 'downloader/request_bytes': 245,
 'downloader/request_count': 1,
 'downloader/request_method_count/GET': 1,
 'finish_reason': 'finished',

我跑：

Warning: cannot find svn location for apsw==3.8.2-r1
BeautifulSoup==3.2.1
CherryPy==3.2.2
EasyProcess==0.2.2
Flask==0.11.1
GDAL==2.1.0
GraphLab-Create==1.6.1
Jinja2==2.8
Mako==0.9.1
Markdown==2.4
MarkupSafe==0.18
PAM==0.4.2
Pillow==2.3.0
PyAudio==0.2.7
PyInstaller==2.1
PyVirtualDisplay==0.2
PyYAML==3.11
Pygments==2.0.2
Routes==2.0
SFrame==2.1
SQLAlchemy==0.8.4
Scrapy==1.0.3
Send2Trash==1.5.0
Shapely==1.5.17
Sphinx==1.2.2
Theano==0.8.2
Twisted==16.2.0
Twisted-Core==13.2.0
Twisted-Names==13.2.0
Twisted-Web==13.2.0
Werkzeug==0.11.10
adblockparser==0.7
## FIXME: could not find svn URL in dependency_links for this package:
apsw==3.8.2-r1
apt-xapian-index==0.45
apturl==0.4.1ubuntu4
argparse==1.2.1
backports-abc==0.4
backports.ssl-match-hostname==3.4.0.2
beautifulsoup4==4.4.1
bokeh==0.11.1
boto==2.41.0
branca==0.1.1
bz2file==0.98
captcha-solver==0.1.1
certifi==2015.9.6.2
characteristic==14.3.0
chardet==2.0.1
click==5.1
cloudpickle==0.2.1
colorama==0.2.5
command-not-found==0.3
configglue==1.1.2
cssselect==0.9.1
cssutils==0.9.10
cymem==1.31.2
debtagshw==0.1
decorator==4.0.2
defer==1.0.6
deluge==1.3.6
dirspec==13.10
dnspython==1.11.1
docutils==0.11
drawnow==0.71.1
duplicity==0.6.23
enum34==1.1.6
feedparser==5.1.3
folium==0.2.1
functools32==3.2.3-2
futures==3.0.5
gensim==0.13.1
geocoder==1.8.2
geolocation-python==0.2.2
geopandas==0.2.1
geopy==1.11.0
gmplot==1.1.1
googlemaps==2.4.2
gyp==0.1
html5lib==0.999
httplib2==0.8
ipykernel==4.0.3
ipython==4.0.0
ipython-genutils==0.1.0
ipywidgets==4.0.3
itsdangerous==0.24
jsonschema==2.6.0
jupyter==1.0.0
jupyter-client==5.2.2
jupyter-console==4.0.2
jupyter-core==4.4.0
jupyterlab==0.31.8
jupyterlab-launcher==0.10.5
lockfile==0.8
lxml==3.3.3
matplotlib==1.3.1
mechanize==0.2.5
mistune==0.7.1
mpmath==0.19
murmurhash==0.26.4
mysql-connector-python==1.1.6
nbconvert==4.0.0
nbformat==4.3.0
netifaces==0.8
nltk==3.2.1
nose==1.3.1
notebook==5.4.0
numpy==1.14.0
oauth2==1.9.0.post1
oauthlib==1.1.2
oneconf==0.3.7
opencage==1.1.4
pandas==0.22.0
paramiko==1.10.1
path.py==7.6
patsy==0.4.1
pexpect==3.1
pickleshare==0.5
piston-mini-client==0.7.5
plac==0.9.6
plotly==2.0.6
preshed==0.46.4
protobuf==2.5.0
psutil==5.0.1
psycopg2==2.4.5
ptyprocess==0.5
py==1.4.31
pyOpenSSL==0.13
pyasn1==0.1.9
pyasn1-modules==0.0.8
pycrypto==2.6.1
pycups==1.9.66
pycurl==7.19.3
pygobject==3.12.0
pyinotify==0.9.4
pymongo==3.3.0
pyparsing==2.0.1
pyserial==2.7
pysmbc==1.0.14.1
pyspatialite==3.0.1
pysqlite==2.6.3
pytesseract==0.2.0
pytest==2.9.2
python-Levenshtein==0.12.0
python-apt==0.9.3.5
python-dateutil==2.6.1
python-debian==0.1.21-nmu2ubuntu2
python-libtorrent==0.16.13
pytz==2017.3
pyxdg==0.25
pyzmq==14.7.0
qt5reactor==0.3
qtconsole==4.0.1
queuelib==1.4.2
ratelim==0.1.6
reportlab==3.0
repoze.lru==0.6
requests==2.10.0
requests-oauthlib==0.6.2
roman==2.0.0
scikit-learn==0.17
scipy==0.17.1
scrapy-random-useragent==0.1
scrapy-splash==0.7.1
seaborn==0.7.0
selenium==2.53.6
semver==2.6.0
service-identity==14.0.0
sessioninstaller==0.0.0
shub==1.3.4
simpledbf==0.2.6
simplegeneric==0.8.1
simplejson==3.3.1
singledispatch==3.4.0.3
six==1.11.0
smart-open==1.3.3
smartystreets.py==0.2.4
spacy==0.101.0
sputnik==0.9.3
spyder==2.3.9
statsmodels==0.6.1
stevedore==0.14.1
subprocess32==3.2.7
sympy==1.0
system-service==0.1.6
terminado==0.8.1
tesseract==0.1.3
textblob==0.11.1
textrazor==1.2.2
thinc==5.0.8
tornado==4.3
traitlets==4.3.2
tweepy==3.3.0
uTidylib==0.2
urllib3==1.7.1
utils==0.9.0
vboxapi==1.0
vincent==0.4.4
virtualenv==15.0.2
virtualenv-clone==0.2.4
virtualenvwrapper==4.1.1
w3lib==1.12.0
wordcloud==1.2.1
wsgiref==0.1.2
yelp==1.0.2
zope.interface==4.0.5

是否有效并加载页面：

curl -v --proxy "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128" "https://www.coursetalk.com/subjects/data-science/courses" and see if it works or not

EDIT3

这是当前日志：

> Host: www.coursetalk.com:443
> Proxy-Authorization: Basic c2FybmVuY2otdXMtMTprZDk5NzIybDJrN3k=
> User-Agent: curl/7.35.0
> Proxy-Connection: Keep-Alive
> 
< HTTP/1.1 200 Connection established
< Date: Fri, 04 May 2018 22:02:00 GMT
< Age: 0
< Transfer-Encoding: chunked
* CONNECT responded chunked
< Proxy-Connection: keep-alive
< Server: Webshare
< 
* Proxy replied OK to CONNECT request
* successfully set certificate verify locations:
*   CAfile: none
  CApath: /etc/ssl/certs
* SSLv3, TLS handshake, Client hello (1):
* SSLv3, TLS handshake, Server hello (2):
* SSLv3, TLS handshake, CERT (11):
* SSLv3, TLS handshake, Server key exchange (12):
* SSLv3, TLS handshake, Server finished (14):
* SSLv3, TLS handshake, Client key exchange (16):
* SSLv3, TLS change cipher, Client hello (1):
* SSLv3, TLS handshake, Finished (20):

Answer 1

我认为这个问题可能与您触及ProxyMiddleware的顺序有关。我更新了你的代码并运行它如下

来自scrapy import Spider

class Test(Spider):
    name ="proxyapp"
    start_urls = ["https://www.coursetalk.com/subjects/data-science/courses"]


    custom_settings = {
        'DOWNLOADER_MIDDLEWARES': {
            'jobs.middlewares.ProxyMiddleware': 100
        }
    }

    def parse(self, response):
        print(response.text)

middlewares.py

class ProxyMiddleware(object):
    # overwrite process request
    def process_request(self, request, spider):
        # Set the location of the proxy
        request.meta['proxy'] =  "http://sarnencj-us-1:kd99722l2k7y@proxyserver.webshare.io:3128"

运行代码并且运行正常

我测试的scrapy版本低于

Scrapy==1.5.0

只是为了100％确定代理是否有效我在ipinfo.io/json

上运行了它

相信我，我不会坐在特拉华州甚至美国那件事

Answer 2

启用HttpProxyMiddleware并在请求元数据中传递代理网址。

<强>蜘蛛

DOWNLOADER_MIDDLEWARES = {
 'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware': 10,
   }

设置

<?xml version="1.0" encoding="UTF-8"?> <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"> <modelVersion>4.0.0</modelVersion> <groupId>test</groupId> <artifactId>spark-kafka</artifactId> <version>1.0-SNAPSHOT</version> <repositories> <repository> <id>hortonworks</id> <name>hortonworks repo</name> <url>http://repo.hortonworks.com/content/repositories/releases/</url> </repository> </repositories> <dependencies>  <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming-kafka_2.10</artifactId> <version>1.3.0</version> </dependency>  <dependency> <groupId>org.apache.spark</groupId> <artifactId>spark-streaming_2.10</artifactId> <version>1.6.0</version> <scope>provided</scope> </dependency> </dependencies> <build> <defaultGoal>package</defaultGoal> <resources> <resource> <directory>src/main/resources</directory> <filtering>true</filtering> </resource> <resource> <directory>src/test/resources</directory> <filtering>true</filtering> </resource> </resources> <plugins> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-resources-plugin</artifactId> <configuration> <encoding>UTF-8</encoding> </configuration> <executions> <execution> <goals> <goal>copy-resources</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>net.alchim31.maven</groupId> <artifactId>scala-maven-plugin</artifactId> <version>3.2.0</version> <configuration> <recompileMode>incremental</recompileMode> <args> <arg>-target:jvm-1.7</arg> </args> <javacArgs> <javacArg>-source</javacArg> <javacArg>1.7</javacArg> <javacArg>-target</javacArg> <javacArg>1.7</javacArg> </javacArgs> </configuration> <executions> <execution> <id>scala-compile</id> <phase>process-resources</phase> <goals> <goal>compile</goal> </goals> </execution> <execution> <id>scala-test-compile</id> <phase>process-test-resources</phase> <goals> <goal>testCompile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-compiler-plugin</artifactId> <configuration> <source>1.7</source> <target>1.7</target> </configuration> <executions> <execution> <phase>compile</phase> <goals> <goal>compile</goal> </goals> </execution> </executions> </plugin> <plugin> <groupId>org.apache.maven.plugins</groupId> <artifactId>maven-shade-plugin</artifactId> <version>2.3</version> <executions> <execution> <phase>package</phase> <goals> <goal>shade</goal> </goals> </execution> </executions> <configuration> <filters> <filter> <artifact>*:*</artifact> <excludes> <exclude>META-INF/*.SF</exclude> <exclude>META-INF/*.DSA</exclude> <exclude>META-INF/*.RSA</exclude> </excludes> </filter> </filters> <finalName>uber-${project.artifactId}-${project.version}</finalName> </configuration> </plugin> </plugins> </build>

使用代理进行scrapy的ssl握手失败

2 个答案: