在Scrapy中以编程方式重置TCP连接

时间:2019-06-27 04:36:22

标签: python http https tcp scrapy

我目前正在抓取一个禁止IP的网站,如果该网站在短时间内请求了太多页面。发生这种情况时,它会在响应中给出403 status code。如果IP地址没有更新,则搜寻器将使以下所有请求失败。

所以我添加了一个HTTP代理,它是一个托管数百个IP的集线器,并为每个TCP连接分配一个随机的IP。

>>> import requests
>>> proxies = {"https": "https://user:pass@proxyservice.com"}
>>> s = requests.Session()
>>> 
>>> print("\n persisted connection:")
>>> for i in range(3):
>>>     print(s.get("https://ifconfig.co", proxies=proxies).text)
>>>
>>> print("\n new connection every request:")
>>> for i in range(3):
>>>     print(requests.get("https://ifconfig.co", proxies=proxies).text)

persisted connection:
123.123.123.123
123.123.123.123
123.123.123.123

new connection every request:
123.111.111.111
123.222.222.222
123.110.110.110

我在项目中使用Scrapy,默认情况下它使用持久连接,这意味着它将为每个连接使用相同的代理IP:

class TestSpider(scrapy.Spider):
    name = 'test'
    allowed_domains = ['ifconfig.co']
    custom_settings = {
        "CONCURRENT_REQUESTS": 2
    }

    def __init__(self):
        self.url = "https://ifconfig.co"
        self.headers = {"user-agent": "curl"}
        self.proxy = "https://user:pass@proxyservice.com"

    def start_requests(self):
        # We have 2 concurrent requests (persisted connections).
        yield Request(url=self.url, headers=self.headers, meta={"proxy": self.proxy}, dont_filter=True)
        yield Request(url=self.url, headers=self.headers, meta={"proxy": self.proxy}, dont_filter=True)

    def parse(self, response):
        self.logger.info(response.text)
        yield Request(url=self.url, headers=self.headers, meta={"proxy": self.proxy}, dont_filter=True)

哪个给了我们

2019-06-27 12:13:21 [test] INFO: 181.xx.xx.197

2019-06-27 12:13:21 [test] INFO: 38.xx.xx.199

2019-06-27 12:13:21 [test] INFO: 181.xx.xx.197

2019-06-27 12:13:22 [test] INFO: 38.xx.xx.199

2019-06-27 12:13:22 [test] INFO: 181.xx.xx.197

2019-06-27 12:13:22 [test] INFO: 38.xx.xx.199

2019-06-27 12:13:22 [test] INFO: 181.xx.xx.197

2019-06-27 12:13:23 [test] INFO: 181.xx.xx.197

2019-06-27 12:13:23 [test] INFO: 38.xx.xx.199

2019-06-27 12:13:23 [test] INFO: 181.xx.xx.197

2019-06-27 12:13:24 [test] INFO: 38.xx.xx.199

如何重置与代理服务器的TCP连接,以便在响应为403时获得新的IP地址?

1 个答案:

答案 0 :(得分:0)

事实证明,我正在使用(https://luminati.io)的代理服务通过在用户名字段中添加参数来支持IP更新。

username-session-%rndint:pass@proxyservice.com

并通过以下方式更改IP区域: username-country-us:pass@proxyservice.com