我想收集一些文章的名称和摘要。网站页面如下所示:
Page 1 (list of conferences):
Conf1, year
Conf2, yaer
....
Page 2 (list of articles for each Conf):
Article1, title
Article2, title
....
Page 2 (the page for each Article):
Title
Abstract
我想收集每个会议的文章(以及有关会议的其他信息,例如年份)。首先,我不知道是否需要为此使用scrapy之类的框架,还是只编写一个python程序。当我检查scrapy时,我可以拥有如下的蜘蛛来收集会议信息:
# -*- coding: utf-8 -*-
import scrapy
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'https://www.aclweb.org/anthology/',
]
def parse(self, response):
for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[1]/tbody/tr/th/a'):
yield {
'name': conf.xpath('./text()').extract_first(),
'link': conf.xpath('./@href').extract_first(),
}
for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table[2]/tbody/tr/th/a'):
yield {
'name': conf.xpath('./text()').extract_first(),
'link': conf.xpath('./@href').extract_first(),
}
next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page_url is not None:
yield scrapy.Request(response.urljoin(next_page_url))
但是,我必须单击每个会议的链接才能看到文章。我没有找到很多示例来说明如何使用scrapy收集我需要的其余数据。当我为每个会议收集数据时,请您指导我如何抓取文章页面?
答案 0 :(得分:2)
您可以编写如下代码
import scrapy
class ToScrapeSpiderXPath(scrapy.Spider):
name = 'toscrape-xpath'
start_urls = [
'https://www.aclweb.org/anthology/',
]
def parse(self, response):
for conf in response.xpath('//*[@id="main-container"]/div/div[2]/main/table/tbody/tr/th/a'):
item = {'name': conf.xpath('./text()').extract_first(),
'link': response.urljoin(conf.xpath('./@href').extract_first())}
yield scrapy.Request(response.urljoin(conf.xpath('./@href').extract_first()), callback=self.parse_listing,
meta={'item': item})
next_page_url = response.xpath('//li[@class="next"]/a/@href').extract_first()
if next_page_url:
yield scrapy.Request(response.urljoin(next_page_url), callback=self.parse)
def parse_listing(self, response):
"""
Parse the listing page urls here
:param response:
:return:
"""
# Fetch listing urls Here == > listing_urls
# for url in listing_urls:
# yield scrapy.Request(url, callback=self.parse_details)
def parse_details(self, response):
"""
Parse product details here
:param response:
:return:
"""
# Fetch product details here. ==> details
# yield details
您还可以查看json输出
scrapy crawl toscrape-xpath -o ouput.csv