粗糙的项目包含来自多个页面的数据。为什么?

时间:2019-12-20 12:25:09

标签: python web-scraping scrapy screen-scraping

我正在尝试抓取文章文本以及分布在多个页面上的元数据。为此,我将单个解析函数的输出传递给加载器,如下所示:

class PrzewodnikKatolickiSpider(Spider):
name = 'przewodnik_katolicki'
allowed_domains = ['przewodnik-katolicki.pl']
start_urls = ['https://www.przewodnik-katolicki.pl/Archiwum?rok=wszystkie']

def parse(self, response):
    self.logger.info('Parse function called on {}'.format(response.url))
    issues = response.xpath("//ul[@class='lista-artykulow']/li")
    for issue in issues:
        # check if this periodical is regular issue
        issue_name_number = issue.xpath('.//h3[@class="naglowek-0"]/a/text()').get().split()
        if issue_name_number[0] == 'Przewodnik' and issue_name_number[1] == 'Katolicki':
            loader = ItemLoader(item=PeriodicalsScraperItem(), response=response, selector=issue)
            loader.add_value('issue_name', 'Przewodnik Katolicki')
            loader.add_value('issue_number', issue_name_number[2])
            loader.add_xpath('issue_cover_url', './/div[@class="zdjecie"]/a/img/@src')
            issue_url = issue.xpath('.//h3[@class="naglowek-0"]/a/@href').get()
            # go to the issue page and pass the current collected issue info
            yield response.follow(issue_url, callback=self.parse_issue, meta={'periodical_item' : loader.load_item()})
    next_page = response.xpath('.//a[@title="Następna"]/@href').get()
    if next_page is not None:
        yield response.follow(next_page, callback=self.parse)

def parse_issue(self, response):
    # check if all articles are aviailable to read
    if not response.xpath('//h3[@class="naglowek-0 klodka"]'):
        loader = ItemLoader(item=response.meta['periodical_item'], response=response)
        loader.add_value('issue_url', response.url)
        articles = response.xpath('.//li[@class="zajawka-art"]/h3[@class="naglowek-0 "]/a/@href').getall()
        for article in articles:
            yield response.follow(article, callback=self.parse_article, meta={'periodical_item' : loader.load_item()})

def parse_article(self, response):
    loader = ItemLoader(item=response.meta['periodical_item'], response=response)
    loader.add_xpath('article_tags', './/div[@class="tagi clearfix"]/ul/li/span/a/text()')
    loader.add_value('article_url', response.url)
    yield loader.load_item()

不幸的是,我的输出是许多项目的组合,例如article_tags是list contains aarticle_tags是一个包含来自三篇不同文章的标签的列表,以及指向这些文章的article_url链接。就像这里:

{'article_tags': ['dziecko',
              'eucharystia',
              'hostia',
              'Ksiądz',
              'oświadczenie',
              'Bełchatów',
              'komunia',
              'Kościół',
              'księża',
              'policja',
              'nato',
              'Polska',
              'Rosja'],
'article_url': ['https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019/Opinie/Milosc-bywa-impulsywna',
             'https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019/Opinie/Jak-uderzyc-zeby-bolalo',
             'https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019/Opinie/Niebezpieczne-marzenia-o-zblizeniu-z-Rosja'],
'issue_cover_url': ['/getattachment/d3a9f548-a69b-4c51-b0e3-375c5e702aed/.aspx?width=150'],
'issue_name': ['Przewodnik Katolicki'],
'issue_number': ['46/2019'],
'issue_url': ['https://www.przewodnik-katolicki.pl/Archiwum/2019/Przewodnik-Katolicki-46-2019']}

如何使一项仅包含一篇文章的数据,而不是将几篇文章合并为一项?

0 个答案:

没有答案