如何修复要提取的不同元素文本

时间:2019-05-05 08:14:18

标签: scrapy

我正在建立新的刮scrap蜘蛛并进行开发

我正在使用Windows 10,并且正在运行。 我的问题是从不同的元素提取文本。该元素有时在(strong tag,p)上具有class,有时具有id,但是我需要实现一个元素以提取行文本。

请检出网站链接

https://exhibits.otcnet.org/otc2019/Public/eBooth.aspx?IndexInList=404&FromPage=Exhibitors.aspx&ParentBoothID=&ListByBooth=true&BoothID=193193&fromFeatured=1

https://exhibits.otcnet.org/otc2019/Public/eBooth.aspx?IndexInList=0&FromPage=Exhibitors.aspx&ParentBoothID=&ListByBooth=true&BoothID=202434

https://exhibits.otcnet.org/otc2019/Public/eBooth.aspx?IndexInList=1218&FromPage=Exhibitors.aspx&ParentBoothID=&ListByBooth=true&BoothID=193194&fromFeatured=1


https://prnt.sc/nkl1vc, 
https://prnt.sc/nkl1zy, 
https://prnt.sc/nkl247,


    # -*- coding: utf-8 -*-
    import scrapy


    class OtcnetSpider(scrapy.Spider):
        name = 'otcnet'
        # allowed_domains = ['otcnet.org']
        start_urls = ['https://exhibits.otcnet.org/otc2019/Public/Exhibitors.aspx?Index=All&ID=26006&sortMenu=107000']

        def parse(self, response):
            links = response.css('a.exhibitorName::attr(href)').extract()

            for link in links:
                ab_link = response.urljoin(link)

                yield scrapy.Request(ab_link, callback=self.parse_p)


        def parse_p(self, response):
            url = response.url

            Company = response.xpath('//h1/text()').extract_first()
            if Company:
                Company = Company.strip()
            Country = response.xpath('//*[@class="BoothContactCountry"]/text()').extract_first()

            State = response.xpath('//*[@class="BoothContactState"]/text()').extract_first()
            if State:
                State = State.strip()
            Address1 = response.xpath('//*[@class="BoothContactAdd1"]/text()').extract_first() 


            City = response.xpath('//*[@class="BoothContactCity"]/text()').extract_first()
            if City:
                City = City.strip()


            zip_c = response.xpath('//*[@class="BoothContactZip"]/text()').extract_first()


            Address = str(Address1)+' '+str(City)+' '+str(State)+' '+str(zip_c)

            Website = response.xpath('//*[@id="BoothContactUrl"]/text()').extract_first()
            Booth = response.css('.eBoothControls li:nth-of-type(1)::text').extract_first().replace('Booth: ','')

            Description = ''





            Products = response.css('.caption b::text').extract()
            Products= ', '.join(Products)
            vid_bulien = response.css('.aa-videos span.hidden-md::text').extract_first()
            if vid_bulien=="Videos":
                vid_bulien = "Yes"
            else:
                vid_bulien = "No"
            Video_present = vid_bulien
            Conference_link = url
            Categories = response.css('.ProductCategoryLi a::text').extract()
            Categories = ', '.join(Categories)


            Address = Address.replace('None','')


            yield {

                    'Company':Company,
                    'Country':Country,
                    'State':State,
                    'Address':Address,
                    'Website':Website,
                    'Booth':Booth,
                    'Description':Description,
                    'Products':Products,
                    'Video_present':Video_present,
                    'Conference_link':Conference_link,
                    'Categories':Categories


            }

我希望输出将是来自不同元素的行描述

1 个答案:

答案 0 :(得分:0)

根据this帖子和出色的@ dimitre-novatchev答案,您需要找到一个节点集交集: 您页面的$ns1是:

//p[@class="BoothProfile"]/following-sibling::p

$ns2是:

p[@class="BoothProfile"]/following-sibling::div[1]/preceding-sibling::p

因此,您需要处理以下p元素:

//p[@class="BoothProfile"]/following-sibling::p[count(.|//p[@class="BoothProfile"]/following-sibling::div[1]/preceding-sibling::p) = count(//p[@class="BoothProfile"]/following-sibling::div[1]/preceding-sibling::p)]

您可以使用以下Scrapy代码:

for p_elem in response.xpath('//p[@class="BoothProfile"]/following-sibling::p[count(.|//p[@class="BoothProfile"]/following-sibling::div[1]/preceding-sibling::p) = count(//p[@class="BoothProfile"]/following-sibling::div[1]/preceding-sibling::p)]'):
    # using string() to stringify <p>
    Description += p_elem.xpath('string(.)').extract_first()