我正在尝试抓取文章文本以及分布在多个页面上的元数据。为此,我将单个解析函数的输出传递给加载器,如下所示:
class PrzewodnikKatolickiSpider(Spider):
name = 'przewodnik_katolicki'
allowed_domains = ['przewodnik-katolicki.pl']
start_urls = ['https://www.przewodnik-katolicki.pl/Archiwum?rok=wszystkie']
def parse(self, response):
self.logger.info('Parse function called on {}'.format(response.url))
issues = response.xpath("//ul[@class='lista-artykulow']/li")
for issue in issues:
# check if this periodical is regular issue
issue_name_number = issue.xpath('.//h3[@class="naglowek-0"]/a/text()').get().split()
if issue_name_number[0] == 'Przewodnik' and issue_name_number[1] == 'Katolicki':
loader = ItemLoader(item=PeriodicalsScraperItem(), response=response, selector=issue)
loader.add_value('issue_name', 'Przewodnik Katolicki')
loader.add_value('issue_number', issue_name_number[2])
loader.add_xpath('issue_cover_url', './/div[@class="zdjecie"]/a/img/@src')
issue_url = issue.xpath('.//h3[@class="naglowek-0"]/a/@href').get()
# go to the issue page and pass the current collected issue info
yield response.follow(issue_url, callback=self.parse_issue, meta={'periodical_item' : loader.load_item()})
next_page = response.xpath('.//a[@title="Następna"]/@href').get()
if next_page is not None:
yield response.follow(next_page, callback=self.parse)
def parse_issue(self, response):
# check if all articles are aviailable to read
if not response.xpath('//h3[@class="naglowek-0 klodka"]'):
loader = ItemLoader(item=response.meta['periodical_item'], response=response)
loader.add_value('issue_url', response.url)
articles = response.xpath('.//li[@class="zajawka-art"]/h3[@class="naglowek-0 "]/a/@href').getall()
for article in articles:
yield response.follow(article, callback=self.parse_article, meta={'periodical_item' : loader.load_item()})
def parse_article(self, response):
loader = ItemLoader(item=response.meta['periodical_item'], response=response)
loader.add_xpath('article_tags', './/div[@class="tagi clearfix"]/ul/li/span/a/text()')
loader.add_value('article_url', response.url)
yield loader.load_item()
不幸的是,我的输出是许多项目的组合,例如article_tags是list contains aarticle_tags是一个包含来自三篇不同文章的标签的列表,以及指向这些文章的article_url链接。就像这里:
{'article_tags': ['dziecko',
'eucharystia',
'hostia',
'Ksiądz',
'oświadczenie',
'Bełchatów',
'komunia',
'Kościół',
'księża',
'policja',
'nato',
'Polska',
'Rosja'],
'article_url': ['https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019/Opinie/Milosc-bywa-impulsywna',
'https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019/Opinie/Jak-uderzyc-zeby-bolalo',
'https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019/Opinie/Niebezpieczne-marzenia-o-zblizeniu-z-Rosja'],
'issue_cover_url': ['/getattachment/d3a9f548-a69b-4c51-b0e3-375c5e702aed/.aspx?width=150'],
'issue_name': ['Przewodnik Katolicki'],
'issue_number': ['46/2019'],
'issue_url': ['https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019']}
如何使一项仅包含一篇文章的数据,而不是将几篇文章合并为一项?