如何在python
网页报废脚本中使用代理,可以从亚马逊中删除数据。我需要学习如何在下面的脚本中使用代理
脚本在这里
import scrapy
from urls import start_urls
import re
class BbbSpider(scrapy.Spider):
AUTOTHROTTLE_ENABLED = True
name = 'bbb_spider'
# start_urls = ['http://www.bbb.org/chicago/business-reviews/auto-repair-and-service-equipment-and-supplies/c-j-auto-parts-in-chicago-il-88011126']
def start_requests(self):
for x in start_urls:
yield scrapy.Request(x, self.parse)
def parse(self, response):
brickset = str(response)
NAME_SELECTOR = 'normalize-space(.//div[@id="titleSection"]/h1[@id="title"]/span[@id="productTitle"]/text())'
#PAGELINK_SELECTOR = './/div[@class="info"]/h3[@class="n"]/a/@href'
ASIN_SELECTOR = './/table/tbody/tr/td/div[@class="content"]/ul/li[./b[text()="ASIN: "]]//text()'
#LOCALITY = 'normalize-space(.//div[@class="info"]/div/p/span[@class="locality"]/text())'
#PRICE_SELECTOR = './/div[@id="price"]/table/tbody/tr/td/span[@id="priceblock_ourprice"]//text()'
PRICE_SELECTOR = '#priceblock_ourprice'
STOCK_SELECTOR = 'normalize-space(.//div[@id="availability"]/span/text())'
PRODUCT_DETAIL_SELECTOR = './/table//div[@class="content"]/ul/li//text()'
PRODUCT_DESCR_SELECTOR = 'normalize-space(.//div[@id="productDescription"]/p/text())'
IMAGE_URL_SELECTOR = './/div[@id="imgTagWrapperId"]/img/@src'
yield {
'name': response.xpath(NAME_SELECTOR).extract_first().encode('utf8'),
'pagelink': response.url,
#'asin' : str(re.search("<li><b>ASIN: </b>([A-Z0-9]+)</li>",brickset).group(1).strip()),
'price' : str(response.css(PRICE_SELECTOR).extract_first().encode('utf8')),
'stock' : str(response.xpath(STOCK_SELECTOR).extract_first()),
'product_detail' : str(response.xpath(PRODUCT_DETAIL_SELECTOR).extract()),
'product_description' : str(response.xpath(PRODUCT_DESCR_SELECTOR).extract()),
'img_url' : str(response.xpath(IMAGE_URL_SELECTOR).extract_first()),
}
并且start_url
文件在这里
start_urls = ['https://www.amazon.co.uk/d/Hair-Care/Loreal-Majirel-Hair-Colour-Tint-Golden-Mahogany/B0085L50QU', 'https://www.amazon.co.uk/d/Hair-Care/Michel-Mercier-Ultimate-Detangling-Wooden-Brush-Normal/B00TE1WH7U']
答案 0 :(得分:1)
据我所知,有两种方法可以将代理用于Python代码:
设置环境变量http_proxy
和https_proxy
,也许这是使用代理的最简单方法。
视窗:
set http_proxy=http://proxy.myproxy.com
set https_proxy=https://proxy.myproxy.com
python get-pip.py
Linux / OS X:
export http_proxy=http://proxy.myproxy.com
export https_proxy=https://proxy.myproxy.com
sudo -E python get-pip.py
自Scrapy 0.8通过HTTP代理下载器中间件提供对HTTP代理的支持。 ,你可以查看HttpProxyMiddleware。
此中间件通过设置Request
个对象的代理元数据值来设置用于请求的HTTP代理。
与Python标准库模块urllib和urllib2一样,它遵循以下环境变量:
http_proxy
https_proxy
no_proxy
希望这有帮助。
答案 1 :(得分:0)
如果你想做内部代码。
这样做。
def start_requests(self):
for x in start_urls:
req = scrapy.Request(x, self.parse)
req.meta['proxy'] = 'your_proxy_ip_here'
yield req
并且不要忘记将其放在settings.py
文件
DOWNLOADER_MIDDLEWARES = {
'scrapy.contrib.downloadermiddleware.httpproxy.HttpProxyMiddleware': 1,
}