系统:Windows 10,Python 2.7.15,Scrapy 1.5.1
目标:从html标记中检索目标网站上每个链接项的文本,包括通过“ +查看更多档案”按钮显示的链接项(一次显示6个)。
目标网站:https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info
初始进度:已成功安装Python和Scrapy。以下代码...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
urls = [
'https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info',
]
for url in urls:
yield Request(url=url, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...成功产生以下结果(当-o到.csv时)...
href,eventtype,eventmonth,eventdate,eventyear
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-08-02,Competitive Standard Constructed League, August ,2, 2018
/en/articles/archive/mtgo-standings/pauper-constructed-league-2018-08-01,Pauper Constructed League, August ,1, 2018
/en/articles/archive/mtgo-standings/competitive-modern-constructed-league-2018-07-31,Competitive Modern Constructed League, July ,31, 2018
/en/articles/archive/mtgo-standings/pauper-challenge-2018-07-30,Pauper Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/legacy-challenge-2018-07-30,Legacy Challenge, July ,30, 2018
/en/articles/archive/mtgo-standings/competitive-standard-constructed-league-2018-07-30,Competitive Standard Constructed League, July ,30, 2018
但是,蜘蛛程序不会触摸到由Ajax按钮隐藏的任何信息。我已经完成了大量的Google搜查工作,并整理了文档,示例文章和“帮助我”的帖子。我的印象是,要让蜘蛛实际看到Ajax嵌入的信息,我需要模拟某种请求。不同地,正确的请求类型可能与XHR,Scrapy FormRequest或其他有关。对于网络架构而言,我实在太新了,无法推测答案。
我整理了一个调用FormRequest的初始代码版本,该代码似乎仍然可以到达初始页面,但是增加了唯一一个似乎发生变化的参数(当检查物理上发出的xhr调用时单击页面上的按钮)似乎没有效果。该代码在这里...
import scrapy
class QuotesSpider(scrapy.Spider):
name = "quotes"
custom_settings = {
# specifies exported fields and order
'FEED_EXPORT_FIELDS': ["href", "eventtype", "eventmonth", "eventdate", "eventyear"],
}
def start_requests(self):
for i in range(1,10):
yield scrapy.FormRequest(url='https://magic.wizards.com/en/content/deck-lists-magic-online-products-game-info', formdata={'l':'en','f':'9041','search-result-theme':'','limit':'6','fromDate':'','toDate':'','event_format':'0','sort':'DESC','word':'','offset':str(i*6)}, callback=self.parse)
def parse(self, response):
for event in response.css('div.article-item-extended'):
yield {
'href': event.css('a::attr(href)').extract(),
'eventtype': event.css('h3::text').extract(),
'eventmonth': event.css('span.month::text').extract(),
'eventdate': event.css('span.day::text').extract(),
'eventyear': event.css('span.year::text').extract(),
}
...的结果与以前相同,除了6条输出线作为一个块重复了9次外。
有人可以帮我指出我所缺少的吗?预先谢谢你。
后记:每当我寻求编码问题的帮助时,我似乎总是会从椅子上跌下来。如果我做错了,请怜悯我,我会尽力纠正它。