我想废弃http://www.3andena.com/,此网站首先以阿拉伯语开头,并将语言设置存储在Cookie中。如果您尝试直接通过URL(http://www.3andena.com/home.php?sl=en)访问语言版本,则会出现问题并返回服务器错误。
所以,我想将cookie值“store_language”设置为“en”,然后使用此cookie值开始废弃网站。
我正在使用带有几条规则的CrawlSpider。
这是代码
from scrapy.spider import BaseSpider
from scrapy.selector import HtmlXPathSelector
from scrapy.contrib.spiders import CrawlSpider, Rule
from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
from scrapy import log
from bkam.items import Product
from scrapy.http import Request
import re
class AndenaSpider(CrawlSpider):
name = "andena"
domain_name = "3andena.com"
start_urls = ["http://www.3andena.com/Kettles/?objects_per_page=10"]
product_urls = []
rules = (
# The following rule is for pagination
Rule(SgmlLinkExtractor(allow=(r'\?page=\d+$'),), follow=True),
# The following rule is for produt details
Rule(SgmlLinkExtractor(restrict_xpaths=('//div[contains(@class, "products-dialog")]//table//tr[contains(@class, "product-name-row")]/td'), unique=True), callback='parse_product', follow=True),
)
def start_requests(self):
yield Request('http://3andena.com/home.php?sl=en', cookies={'store_language':'en'})
for url in self.start_urls:
yield Request(url, callback=self.parse_category)
def parse_category(self, response):
hxs = HtmlXPathSelector(response)
self.product_urls.extend(hxs.select('//td[contains(@class, "product-cell")]/a/@href').extract())
for product in self.product_urls:
yield Request(product, callback=self.parse_product)
def parse_product(self, response):
hxs = HtmlXPathSelector(response)
items = []
item = Product()
'''
some parsing
'''
items.append(item)
return items
SPIDER = AndenaSpider()
这是日志:
2012-05-30 19:27:13+0000 [andena] DEBUG: Redirecting (301) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://3andena.com/home.php?sl=en>
2012-05-30 19:27:14+0000 [andena] DEBUG: Redirecting (302) to <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098> from <GET http://www.3andena.com/home.php?sl=en&xid_479d9=97656c0c5837f87b8c479be7c6621098>
2012-05-30 19:27:14+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/Kettles/?objects_per_page=10> (referer: None)
2012-05-30 19:27:15+0000 [andena] DEBUG: Crawled (200) <GET http://www.3andena.com/B-and-D-Concealed-coil-pan-kettle-JC-62.html> (referer: http://www.3andena.com/Kettles/?objects_per_page=10)
答案 0 :(得分:10)
修改您的代码如下:
def start_requests(self):
for url in self.start_urls:
yield Request(url, cookies={'store_language':'en'}, callback=self.parse_category)
Scrapy.Request对象接受可选的cookies
关键字参数see documentation here
答案 1 :(得分:7)
这就是我在Scrapy 0.24.6中的表现:
from scrapy.contrib.spiders import CrawlSpider, Rule
class MySpider(CrawlSpider):
...
def make_requests_from_url(self, url):
request = super(MySpider, self).make_requests_from_url(url)
request.cookies['foo'] = 'bar'
return request
Scrapy使用蜘蛛的make_requests_from_url
属性中的网址调用start_urls
。上面的代码是让默认实现创建请求,然后添加值foo
的{{1}} cookie。 (或者将cookie更改为值bar
,如果碰巧发生这种情况,那么默认实现产生的请求已经存在bar
cookie。)
如果你想知道从foo
创建 not 的请求会发生什么,让我补充一点,Scrapy的cookie中间件会记住上面代码设置的cookie并将其设置为全部与您明确添加Cookie的请求共享同一域的未来请求。
答案 2 :(得分:3)
直接从Scrapy documentation for Requests and Responses.
你需要这样的东西
request_with_cookies = Request(url="http://www.3andena.com", cookies={'store_language':'en'})