Question

我编写了一个脚本并使用Scrapy在第一阶段查找链接并按照链接在第二阶段从页面中提取内容。 Scrapy DOES它但是它以无序的方式跟随链接，即我期望输出如下：

link1 | data_extracted_from_link1_destination_page
link2 | data_extracted_from_link2_destination_page
link3 | data_extracted_from_link3_destination_page
.
.
.

但我得到

link1 | data_extracted_from_link2_destination_page
link2 | data_extracted_from_link3_destination_page
link3 | data_extracted_from_link1_destination_page
.
.
.

这是我的代码：

import scrapy


class firstSpider(scrapy.Spider):
    name = "ipatranscription"
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']

    def parse(self, response):
        body = response.xpath('./body/div[3]/div[1]/div/a')
        LinkTextSelector = './text()'
        LinkDestSelector = './@href'

        for link in body:
            LinkText = link.xpath(LinkTextSelector).extract_first()
            LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first())

            yield {"LinkText": LinkText}
            yield scrapy.Request(url=LinkDest, callback=self.parse_contents)

    def parse_contents(self, response):

        lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
        sContent = ""
        for i in lContent:
            sContent += i
        sContent = sContent.replace("\n", "").replace("\t", "")
        yield {"LinkContent": sContent}

我的代码有什么问题？

Answer 1

yield不同步，你应该使用meta来实现这一点。 Doc：https://doc.scrapy.org/en/latest/topics/request-response.html
代码：

import scrapy
class firstSpider(scrapy.Spider):
    name = "ipatranscription"
    start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']
    def parse(self, response):
        body = response.xpath('./body/div[3]/div[1]/div/a')
        LinkTextSelector = './text()'
        LinkDestSelector = './@href'
        for link in body:
            LinkText = link.xpath(LinkTextSelector).extract_first()
            LinkDest = 
              response.urljoin(link.xpath(LinkDestSelector).extract_first())
            yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText})

    def parse_contents(self, response):
        lContent = 
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
        sContent = ""
        for i in lContent:
            sContent += i
        sContent = sContent.replace("\n", "").replace("\t", "")
        linkText = response.meta['LinkText']
        yield {"LinkContent": sContent,"LinkText": linkText}

按顺序使scrapy跟随链接

1 个答案: