我想通过scrapy从多页获取数据。如果我想从第二页获取数据,我应该使用cookie来传递搜索词。(因为搜索词不会出现在URL中)
第一页的网址是:
http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky
第二页的网址是:
http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=10
我在堆栈溢出中看到了很多问题,他们都知道在抓取数据之前cookie是什么。但是我只能在抓完第一页时获取cookie。所以我想知道如何处理这个问题? 这是我的代码:
__author__ = 'Rabbit'
from scrapy.spiders import Spider
from scrapy.selector import Selector
from scrapy_Data.items import EPGD
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
stmp = []
term = "man"
url_base = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky"
start_urls = stmp
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
yield item
答案 0 :(得分:0)
Scrapy接收并跟踪服务器发送的cookie,并将其发送回后续请求,就像任何常规Web浏览器一样,查看更多信息here
我不知道你的代码是如何分页的,但它应该是这样的:
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
stmp = []
term = "man"
my_urls = ["http://epgd.biosino.org/EPGD/search/textsearch.jsp?textquery=man&submit=Feeling+Lucky"]
def parse(self, response):
sel = Selector(response)
sites = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
for site in sites:
item = EPGD()
item['genID'] = map(unicode.strip, site.xpath('td[1]/a/text()').extract())
item['taxID'] = map(unicode.strip, site.xpath('td[2]/a/text()').extract())
item['familyID'] = map(unicode.strip, site.xpath('td[3]/a/text()').extract())
item['chromosome'] = map(unicode.strip, site.xpath('td[4]/text()').extract())
item['symbol'] = map(unicode.strip, site.xpath('td[5]/text()').extract())
item['description'] = map(unicode.strip, site.xpath('td[6]/text()').extract())
yield item
yield Request('http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=10',
callback=self.parse_second_url)
def parse_second_url(self, response):
# do your thing
第二个请求带有第一个请求的cookie。
答案 1 :(得分:-1)
我刚看到你在这里发布的问题与你之前在this post中发布的问题相同,我昨天回答了这个问题。所以我再次在这里发布我的答案,并将其交给主持人......
当你的parse()函数添加链接解析和请求让步时,你的例子对我有用。也许该页面会生成一些服务器端cookie。但是使用像Scrapy's Crawlera这样的代理服务(从多个IP下载)它虽然失败了。
解决方案是手动将'textquery'参数输入到请求网址:
import urlparse
from urllib import urlencode
from scrapy import Request
from scrapy.spiders import Spider
from scrapy.selector import Selector
class EPGD_spider(Spider):
name = "EPGD"
allowed_domains = ["epgd.biosino.org"]
term = 'calb'
base_url = "http://epgd.biosino.org/EPGD/search/textsearch.jsp?currentIndex=0&textquery=%s"
start_urls = [base_url % term]
def update_url(self, url, params):
url_parts = list(urlparse.urlparse(url))
query = dict(urlparse.parse_qsl(url_parts[4]))
query.update(params)
url_parts[4] = urlencode(query)
url = urlparse.urlunparse(url_parts)
return url
def parse(self, response):
sel = Selector(response)
genes = sel.xpath('//tr[@class="odd"]|//tr[@class="even"]')
for gene in genes:
item = {}
item['genID'] = map(unicode.strip, gene.xpath('td[1]/a/text()').extract())
# ...
yield item
urls = sel.xpath('//div[@id="nviRecords"]/span[@id="quickPage"]/a/@href').extract()
for url in urls:
url = response.urljoin(url)
url = self.update_url(url, params={'textquery': self.term})
yield Request(url)
来自Lukasz解决方案的 update_url()函数详细信息:
Add params to given URL in Python