在scrapy_fake_useragent和cfscrape scrapy扩展之间共享USER_AGENT

时间:2017-01-11 11:12:54

标签: python web-scraping scrapy user-agent scrapy-spider

我正试图使用​​cfscrapeprivoxy和tor {和scrapy_fake_useragent

为受云计算保护的网站创建一个刮刀

我使用cfscrape python extension通过scrapy绕过云保护,scrapy_fake_useragent将随机真实的USER_AGENT信息注入标题。

如cfscrape文档所示:您必须使用相同的用户代理字符串来获取令牌和使用这些令牌发出请求,否则Cloudflare会将您标记为机器人。

To collect cookie needed by `cfscrape`, i need to redefine the `start_request` function into my spider class, like this : 

    def start_requests(self):
        cf_requests = []
        for url in self.start_urls:
            token, agent = cfscrape.get_tokens(url)
            self.logger.info("agent = %s", agent)
            cf_requests.append(scrapy.Request(url=url,
                                              cookies= token,
                                              headers={'User-Agent': agent}))
        return cf_requests

我的问题是user_agent收集的start_requestsuser_agent随机选择的scrapy_fake_useragent不同,如您所见:

017-01-11 12:15:08 [airports] INFO: agent = Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:41.0) Gecko/20100101 Firefox/41.0
2017-01-11 12:15:08 [scrapy.core.engine] INFO: Spider opened
2017-01-11 12:15:08 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2017-01-11 12:15:08 [scrapy.extensions.telnet] DEBUG: Telnet console listening on 127.0.0.1:6023
2017-01-11 12:15:08 [scrapy_fake_useragent.middleware] DEBUG: Assign User-Agent Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3 (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10 to Proxy http://127.0.0.1:8118

我在settings.py这个顺序中定义了我的扩展程序:

RANDOM_UA_PER_PROXY = True
HTTPS_PROXY = 'http://127.0.0.1:8118'
COOKIES_ENABLED = True

DOWNLOADER_MIDDLEWARES = {
    'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware': None,
    'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400,
    'flight_project.middlewares.ProxyMiddleware': 100,
    'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware':110,
    }

我需要相同的user_agent,那么我如何将scrapy_fake_useragent随机提供的用户代理传递到start_requests cfscrape方法的 @Entity @Table(name = "MY_ENTITY") @EntityListeners(AuditListener.class) public class MyEntity { ..... @OneToMany(mappedBy = "myEntity", cascade = CascadeType.ALL, orphanRemoval = true) private Set<AnotherEntity> anotherEntities; ..... } @Entity @EntityListeners(AuditListener.class) @Table(name="ANOTHER_ENTITY") public class AnotherEntity { @MapsId("idMyEntity") @ManyToOne @JoinColumn(name = "ID_MY_ENTITY") private MyEntity myEntity; } 方法?

1 个答案:

答案 0 :(得分:2)

最终在scrapy_user_agent开发人员的帮助下找到了答案。停用'scrapy_fake_useragent.middleware.RandomUserAgentMiddleware': 400中的行settings.py,然后编写此源代码:

class AirportsSpider(scrapy.Spider):
    name = "airports"
    start_urls = ['https://www.flightradar24.com/data/airports']
    allowed_domains = ['flightradar24.com']

    ua = UserAgent()
    ...

    def start_requests(self):
        cf_requests = []
        user_agent = self.ua.random
        self.logger.info("RANDOM user_agent = %s", user_agent)
        for url in self.start_urls:
            token , agent = cfscrape.get_tokens(url,user_agent)
            self.logger.info("token = %s", token)
            self.logger.info("agent = %s", agent)

            cf_requests.append(scrapy.Request(url=url,
                                          cookies= token,
                                          headers={'User-Agent': agent}))
        return cf_requests