我正在尝试用scrapy抓取新闻文章及其评论。就我而言,新闻文章及其评论在不同的网页上,如下例所示。
(2)与文章相关的评论的链接。 http://www.theglobeandmail.com/opinion/editorials/if-britain-leaves-the-eu-will-scotland-leave-britain/article32480429/comments/
我希望我的程序能够理解(1)和(2)是相关的。另外,我想确保(2)在(1)之后立即被刮,而不是在中间抓取其他网页。我使用以下规则来抓取新闻文章网页和评论网页。
rules = (
Rule(LinkExtractor(allow = r'\/article\d+\/$'), callback="parse_articles"),
Rule(LinkExtractor(allow = r'\/article\d+\/comments\/$'), callback="parse_comments")
)
我尝试在文章的解析函数中使用显式的Request调用,如下所示:
comments_url = response.url + 'comments/'
print('comments url: ', comments_url)
return Request(comments_url, callback=self.parse_comments)
但它没有用。如何在抓取文章网页后立即要求抓取工具抓取评论网页?
答案 0 :(得分:0)
您需要手动设置评论页面的请求
您的crawlspider发现的每个文章页面都应该有一个评论页面网址,对吧?
在这种情况下,您只需在parse_article()
方法中链接审核页面请求。
from scrapy import Request
from scrapy.spiders import CrawlSpider
class MySpider(CrawlSpider):
rules = (
Rule(LinkExtractor(allow = r'\/article\d+\/$'), callback="parse_articles"),
)
comments_le = LinkExtractor(allow = r'\/article\d+\/comments\/$')
def parse_article(self, response):
item = dict()
# fill up your item
...
# find comments url
comments_link = comments_le.extract_links()[0].link
if comments_link:
# yield request and carry over your half-complete item there too
yield Request(comments_link, self.parse_comments,
meta={'item':item})
else:
yield item
def parse_comments(self, response):
# retrieve your half-complete item
item = response.meta['item']
# add some things to your item
...
yield item