我想在脚本中添加代理。
我该怎么做?我必须使用Selenium还是Scrapy?
我认为Scrapy正在发出初始请求,因此使用scrapy会很有意义。但是我到底该怎么办?
您能推荐任何可靠的代理列表吗?
这是我当前的脚本:
# -*- coding: utf-8 -*-
import scrapy
from scrapy import Spider
from scrapy.selector import Selector
from scrapy.http import Request
from selenium.common.exceptions import NoSuchElementException
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import re
import csv
from time import sleep
class PostsSpider(Spider):
name = 'posts'
allowed_domains = ['xyz']
start_urls = ('xyz',)
def parse(self, response):
with open("urls.txt", "rt") as f:
start_urls = [url.strip() for url in f.readlines()]
for url in start_urls:
self.driver = webdriver.Chrome('C:\webdrivers\chromedriver.exe')
self.driver.get(url)
try:
self.driver.find_element_by_id('currentTab').click()
sleep(3)
self.logger.info('Sleeping for 5 sec.')
self.driver.find_element_by_xpath('//*[@id="_blog-menu"]/div[2]/div/div[2]/a[3]').click()
sleep(7)
self.logger.info('Sleeping for 7 sec.')
except NoSuchElementException:
self.logger.info('Blog does not exist anymore')
while True:
try:
element = self.driver.find_element_by_id('last_item')
self.driver.execute_script("arguments[0].scrollIntoView(0, document.documentElement.scrollHeight-5);", element)
sleep(3)
self.driver.find_element_by_id('last_item').click()
sleep(7)
except NoSuchElementException:
self.logger.info('No more tipps')
sel = Selector(text=self.driver.page_source)
allposts = sel.xpath('//*[@class="block media _feedPick feed-pick"]')
for post in allposts:
username = post.xpath('.//div[@class="col-sm-7 col-lg-6 no-padding"]/a/@title').extract()
publish_date = post.xpath('.//*[@class="bet-age text-muted"]/text()').extract()
yield {'Username': username,
'Publish date': publish_date}
self.driver.close()
break
答案 0 :(得分:1)
一种简单的方法是设置gradlew clean build
和http_proxy
环境变量。
您可以在启动脚本之前在环境中设置它们,或者可以在脚本的开头添加它们:
https_proxy
要获取公开代理的列表,如果您只是在Google中搜索,就会发现很多信息。
答案 1 :(得分:1)
您应该阅读Scrapy ProxyMiddleware,以最好地进行探索。其中也提到了使用代理的方法