如何在scrapy中使用Itemloaders时迭代选择器列表?详细说明

时间:2018-03-30 06:59:06

标签: python web-scraping scrapy scrapy-spider

我正在努力削减一份联合国会员国及其详细信息。这是我的方法,不使用项目装载程序

在这里,我收到一个父标记包含 所有联合国成员的详细信息,如姓名,加入日期,网站,电话号码和联合国总部。并非所有国家/地区都有网站,电话号码和儿童详细信息 我正在通过父标记运行循环并逐个提取详细信息并将其存储在变量中然后我将变量分配给项目。

import scrapy
from learn_scrapy.items import UNMemberItem

class UNMemberDetails(scrapy.Spider):
    name  = 'UN_details'
    start_urls = ['http://www.un.org/en/member-states/index.html']

    def parse(self, response):
        """
        Get the details of the UN members
        """
        members_tag = response.css('div.member-state.col-md-12')
        #item_list = []
        for member  in members_tag:
            member_name = member.css('span.member-state-name::text').extract()
            member_join_date = member.css('span.date-display-single::text').extract()
            member_website = member.css('div.site >  a::text').extract()
            member_phone = member.css('div.phone > ul > li::text').extract()
            member_address = member.css('div.mail >  a::text').extract()
            member_national_holiday = member.css('div.national-holiday::text').extract()
            UN_member = UNMemberItem()
            UN_member['country_name'] = member_name
            UN_member['join_date'] = member_join_date
            if len(member_website) == 0:
                member_website ='NA'
            UN_member['website'] = member_website
            if len(member_phone) == 0:
                    member_phone = 'NA'
            UN_member['phone'] = member_phone
            if len(member_address) == 0:
                member_address = 'NA'
            UN_member['mail_address'] = member_address
            UN_member['national_holiday'] = member_national_holiday
            print (UN_member)
            UN_member = str(UN_member)
            #item_list.append(UN_members)
            with open('un_members_list.txt','a') as f:
                    f.write(UN_member + "\n")

这是我的进步。我在一个项目中获得了完整的国家/地区列表。我希望项目中有一个国家/地区。在这种情况下,我应该接受什么?

import scrapy

from learn_scrapy.items import UNMemberItem
from scrapy.loader import ItemLoader

class UNMemberDetails(scrapy.Spider):
    name  = 'UN_details_loader'
    start_urls = ['http://www.un.org/en/member-states/index.html']

    def parse(self, response):
        item_loader_object = ItemLoader(UNMemberItem(), response=response)
        nested_loader  =  item_loader_object.nested_css('div.member-state.col-md-12')
        nested_loader.add_css('country_name', 'span.member-state-name::text')
        nested_loader.add_css('join_date', 'span.date-display-single::text')
        nested_loader.add_css('website', 'div.site >  a::text')
        nested_loader.add_css('phone','div.phone > ul > li::text')
        nested_loader.add_css('mail_address','div.mail > a::text')
        nested_loader.add_css('national_holiday','div.national-holiday::text')

1 个答案:

答案 0 :(得分:0)

经过研究,我找到了解决方案。

代替此

def parse(self, response):
        item_loader_object = ItemLoader(UNMemberItem(), response=response)

您必须在方法中指定选择器参数。这意味着您的ItemLoader将从指定的'选择器'中提取项目,而不是整个响应(整个网页)中提取项目。

这就像从整个响应(页面)中选择页面的一部分,然后从中选择您的项目,而且您正在遍历整个页面。

def parse(self, response):
            item_loader_object = ItemLoader(UNMemberItem(), selector=member_tag)

新代码需要这样的内容

members_tag  =  response.css('div.member-state.col-md-12')
for member  in members_tag:
      item_loader_object = ItemLoader(UNMemberItem(), response=member)
      item_loader.add_css('country_name','span.member-state-name::text')
      item_loader.add_css('join_date','span.date-display-single::text')
      item_loader.add_css('website', 'div.site >  a::text')
      item_loader.add_css('phone','div.phone > ul > li::text')
      item_loader.add_css('mail_address','div.mail > a::text')
      item_loader.add_css('national_holiday','div.national-holiday::text')

该代码比问题中的第一个代码片段干净得多,并且可以完成工作。