我编写了一个脚本并使用Scrapy在第一阶段查找链接并按照链接在第二阶段从页面中提取内容。 Scrapy DOES它但是它以无序的方式跟随链接,即我期望输出如下:
link1 | data_extracted_from_link1_destination_page
link2 | data_extracted_from_link2_destination_page
link3 | data_extracted_from_link3_destination_page
.
.
.
但我得到
link1 | data_extracted_from_link2_destination_page
link2 | data_extracted_from_link3_destination_page
link3 | data_extracted_from_link1_destination_page
.
.
.
这是我的代码:
import scrapy
class firstSpider(scrapy.Spider):
name = "ipatranscription"
start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']
def parse(self, response):
body = response.xpath('./body/div[3]/div[1]/div/a')
LinkTextSelector = './text()'
LinkDestSelector = './@href'
for link in body:
LinkText = link.xpath(LinkTextSelector).extract_first()
LinkDest = response.urljoin(link.xpath(LinkDestSelector).extract_first())
yield {"LinkText": LinkText}
yield scrapy.Request(url=LinkDest, callback=self.parse_contents)
def parse_contents(self, response):
lContent = response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
sContent = ""
for i in lContent:
sContent += i
sContent = sContent.replace("\n", "").replace("\t", "")
yield {"LinkContent": sContent}
我的代码有什么问题?
答案 0 :(得分:1)
yield不同步,你应该使用meta来实现这一点。
Doc:https://doc.scrapy.org/en/latest/topics/request-response.html
代码:
import scrapy
class firstSpider(scrapy.Spider):
name = "ipatranscription"
start_urls = ['http://www.phonemicchart.com/transcribe/biglist.html']
def parse(self, response):
body = response.xpath('./body/div[3]/div[1]/div/a')
LinkTextSelector = './text()'
LinkDestSelector = './@href'
for link in body:
LinkText = link.xpath(LinkTextSelector).extract_first()
LinkDest =
response.urljoin(link.xpath(LinkDestSelector).extract_first())
yield scrapy.Request(url=LinkDest, callback=self.parse_contents, meta={"LinkText": LinkText})
def parse_contents(self, response):
lContent =
response.xpath("/html/body/div[3]/div[1]/div/center/span/text()").extract()
sContent = ""
for i in lContent:
sContent += i
sContent = sContent.replace("\n", "").replace("\t", "")
linkText = response.meta['LinkText']
yield {"LinkContent": sContent,"LinkText": linkText}