如何将信息从一种方法传递到另一种方法

时间:2020-05-08 15:34:57

标签: python web-scraping scrapy web-crawler metadata

我正在从某个网站上抓取数据,这需要我从各个候选个人资料中获取数据。渔获物是,要从配置文件摘要中提取一部分数据,而在进入配置文件后必须提取其余数据。

要使用代码段提取的字段为: 1.工作授权 2.候选人姓名 3.图片ID

打开配置文件后,便可以提取其余数据。

问题:

我已经编写了蜘蛛程序,并希望将上述字段的数据从一种方法传递到另一种方法。现在,当我抓取蜘蛛时,我得到了特定页面上所有候选配置文件重复的这三个字段的数据。我实际上是Web抓取和python的新手。你能帮我吗?

我附上我的蜘蛛代码和items.py文件以供参考:

import scrapy
from urllib.parse import urljoin
from hbs_candidates.items import HbsCandidatesItem

domain = 'https://www.myvisajobs.com'
url = 'https://www.myvisajobs.com/CV/Search.aspx?DG=Bachelor&P=1'
page_scraped = 2
classes = ['HighLight: ', 'Membership: ', 'Honor: ', 'Skills: ', 'Degree: ', 'Career Level: ', 'Certification: ','Occupation: ', 'Reference: ', 'Target Locations: ', 'Career Title: ', 'Goal: ', 'Target Title:']


class InfoSpider(scrapy.Spider):
    name = 'inform'
    start_urls = [url]
    # page_no = 1

    def parse(self, response):
        wa_temp = []
        items = HbsCandidatesItem()
        tables = response.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr""")
        names_temp = tables.css('b a::text').extract()
        images_temp = [domain + x for x in response.css('img::attr(src)').extract()[1:]]
        for i in range(len(tables)):
            wa = str(tables.xpath("""//*[@id="ctl00_ctl00_ContentPlaceHolder1_ContentPlaceHolder1_divContent"]/center/table/tr[3]/td[2]/text()[6]""").extract()).split('Work Authorization: ')[1]
            if wa is not None:
                temp_wa = wa
            else:
                temp_wa = 'N/A'
            wa_temp.append(temp_wa)
        my_list = response.css('b a::attr(href)').extract()
        for i in range(len(my_list)):
            url_final = urljoin(url, my_list[i])
            temp_url = response.urljoin(url_final)
            items['Candidate Name'] = names_temp[i]
            items['Image ID'] = images_temp[i]
            items['Work Authorization'] = wa_temp[i]
            request = scrapy.Request(temp_url, callback=self.parse_can_contents)
            request.cb_kwargs['items'] = items
            yield request

    def parse_can_contents(self, response, items):
        ### code to scrape data from profile page and assigning values to 
        items
        -----------
        -------------

        ## I want to access the values passed from parse method here    
        yield items

items.py代码:

from scrapy.item import Item, Field


class HbsCandidatesItem(Item):
    def __setitem__(self, key, value):
        if key not in self.fields:
            self.fields[key] = Field()
        self._values[key] = value

我希望这很清楚。请问这个问题是否模棱两可。谢谢!

0 个答案:

没有答案