Question

I want scrape relative website which link shared below.I need some parameters and I found the best solution like this for me.But I need scape different 2 part and I have no idea how to combine it well (combine as column) That is why I need your help.Also I am open for better solution. I need also skip some row cause of wrong scrape.Also I Dont wanna add some null rows. I will share output as a file . http://s7.dosya.tc/server14/tnx4u0/test.json.zip.html

In fact it must be table loop inside of base loop. But for show it better I did it like that for now. Thanks a lot

class KingsatSpider(Spider):
        name = 'kingsat'
        allowed_domains = ['https://tr.kingofsat.net/tvsat-turksat4a.php']
        start_urls = ['https://tr.kingofsat.net/tvsat-turksat4a.php']


    def parse(self, response):
        tables=response.xpath('//*[@class="fl"]/tr')
        bases=response.xpath('//table[@class="frq"]/tr')        

        for base in bases:
            yield {
            'Frekans':base.xpath('.//td[3]/text()').extract_first(),
            'Polarizasyon':base.xpath('.//td[4]/text()').extract_first(),
            'Kapsam':base.xpath('.//td[6]/a/text()').extract_first(),
            'SR':base.xpath('.//td[9]/a[1]/text()').extract_first(),
            'FEC':base.xpath('.//td[9]/a[2]/text()').extract_first(),
            }

            for table in tables:
                yield  {
                'channel' :table.xpath('.//td[3]/a/text()').extract_first(),
                'V-PID' : table.xpath('.//td[9]/text()[1]').extract_first(),
                'A-PID' : table.xpath('.//td[10]/text()[1]').extract_first(),
            }

Answer 1

页面具有结构

基本（标题）
具有很多行的表
基本（标题）
具有很多行的表

等

您将bases中的所有标头和tables中的所有行作为单独的项目，但是必须将表作为单个元素，以便可以创建对（基础，表），然后应该获取行从每个表中得出正确的base

在xpath中，我得到tables而没有tr-这样我就可以创建对（base，table-with-all-its-rows）。

然后我可以从table获取行，并使用其base进行屈服。

我无法测试。也许您必须先跳过base-zip(bases[1:], tables)

    bases = response.xpath('//table[@class="frq"]/tr')        
    tables = response.xpath('//*[@class="fl"]')

    for base, tabel in zip(bases, tables):
        rows = table.xpath('.//tr')
        for row in rows:
            yield {
                'Frekans':      base.xpath('.//td[3]/text()').extract_first(),
                'Polarizasyon': base.xpath('.//td[4]/text()').extract_first(),
                'Kapsam':       base.xpath('.//td[6]/a/text()').extract_first(),
                'SR':           base.xpath('.//td[9]/a[1]/text()').extract_first(),
                'FEC':          base.xpath('.//td[9]/a[2]/text()').extract_first(),
                'channel' :     row.xpath('.//td[3]/a/text()').extract_first(),
                'V-PID' :       row.xpath('.//td[9]/text()[1]').extract_first(),
                'A-PID' :       row.xpath('.//td[10]/text()[1]').extract_first(),
            }

Answer 2

如果表与基数相关，则只需将它们分为两部分，这是解决问题的最佳方法。如果它们彼此不相关并且计数相同，则可以使用以下方法。

def parse(self, response):
    tables=response.xpath('//*[@class="fl"]/tr')
    bases=response.xpath('//table[@class="frq"]/tr')        
for i in range(len(bases)):
    yield {
    'Frekans':base[i].xpath('.//td[3]/text()').extract_first(),
    'A-PID' : table[i].xpath('.//td[10]/text()[1]').extract_first(),
    }

如果它们的数量不相同，则只能将它们视为一个整体。然后您可以通过管道对其进行处理

How to combine yield

2 个答案: