如何使用xpath从html表中检索项目列表?

时间:2019-01-06 10:59:29

标签: python xpath web-scraping scrapy

我正在尝试将表信息提取到python 3.7中的字典中。

表中的html如下所示:

            <dl class="rlxr-specs__block-list">
                <dt class="rlxr-specs__block-list--name">heading</dt>
                <dd class="rlxr-specs__definition-content">
                    <div class="rlxr-specs__definition-title">Key1</div>
                    <span class="rlxr-specs__definition-desc">bla</span>
                </dd>
                <dd class="rlxr-specs__definition-content">
                    <div class="rlxr-specs__definition-title">Key2</div>
                    <span class="rlxr-specs__definition-desc">blub</span>
                </dd>

我最好的猜测是:

items{}
for row in response.xpath('//dd[@class="rlxr-specs__definition-content"]'):
    items[row.xpath('./div/text()').extract_first()] = items[row.xpath('./span/text()').extract_first()]

我收到了Keyerror,该错误来自页面另一部分。因此,xpath选择器中的某些内容一定是错误的。

更多信息:

>>> for row in response.xpath('//dd[@class="rlxr-specs__definition-content"]'):
...     print(row.xpath('./div/text()'))
... 
[<Selector xpath='./div/text()' data='Gehäuse'>]
[<Selector xpath='./div/text()' data='Aufbau des Oyster Gehäuses'>]
[<Selector xpath='./div/text()' data='Durchmesser'>]
[<Selector xpath='./div/text()' data='Material'>]
[<Selector xpath='./div/text()' data='Lünette'>]
[<Selector xpath='./div/text()' data='Aufzugskrone'>]
[<Selector xpath='./div/text()' data='Uhrglas'>]
[<Selector xpath='./div/text()' data='Wasserdichtheit'>]
[<Selector xpath='./div/text()' data='Manufakturwerk'>]
[<Selector xpath='./div/text()' data='Kaliber'>]
[<Selector xpath='./div/text()' data='Ganggenauigkeit'>]
[<Selector xpath='./div/text()' data='Funktionen'>]
[<Selector xpath='./div/text()' data='Oszillator'>]
[<Selector xpath='./div/text()' data='Aufzug'>]
[<Selector xpath='./div/text()' data='Gangreserve'>]
[<Selector xpath='./div/text()' data='Armband'>]
[<Selector xpath='./div/text()' data='Material'>]
[<Selector xpath='./div/text()' data='Schließe'>]
[<Selector xpath='./div/text()' data='Zifferblatt'>]
[<Selector xpath='./div/text()' data='Edelsteinfassung'>]
[]
>>> for row in response.xpath('//dd[@class="rlxr-specs__definition-content"]'):
...     print(row.xpath('./span/text()'))
... 
[<Selector xpath='./span/text()' data='Oyster, 28 mm, Edelstahl Oystersteel und'>]
[<Selector xpath='./span/text()' data='Monoblock-Mittelteil, verschraubter Gehä'>]
[<Selector xpath='./span/text()' data='28 mm'>]
[<Selector xpath='./span/text()' data='Rolesor Everose (Kombination aus Edelsta'>]
[<Selector xpath='./span/text()' data='Diamantlünette'>]
[<Selector xpath='./span/text()' data='Verschraubbare Twinlock-Aufzugskrone mit'>]
[<Selector xpath='./span/text()' data='Kratzfestes Saphirglas, Zykloplupe\xa0zur\xa0V'>]
[<Selector xpath='./span/text()' data='Bis 100 Meter Tiefe wasserdicht'>]
[<Selector xpath='./span/text()' data='Mechanisches Perpetual-Uhrwerk, Selbstau'>]
[<Selector xpath='./span/text()' data='2236, Rolex Manufakturwerk'>]
[<Selector xpath='./span/text()' data='-2/+2 Sekunden pro Tag, gemessen nach de'>]
[<Selector xpath='./span/text()' data='Stunden-, Minuten- und Sekundenzeiger im'>]
[]
[<Selector xpath='./span/text()' data='Selbstaufzugsmechanismus, in beide Richt'>]
[<Selector xpath='./span/text()' data='Circa 55 Stunden'>]
[<Selector xpath='./span/text()' data='Jubilé, fünfreihig'>]
[<Selector xpath='./span/text()' data='Rolesor Everose (Kombination aus Edelsta'>]
[<Selector xpath='./span/text()' data='Verdeckte Crownclasp-Faltschließe'>]
[<Selector xpath='./span/text()' data='Helles Perlmuttzifferblatt mit Diamanten'>]
[<Selector xpath='./span/text()' data='Diamanten, Fassung 18 Karat Gold'>]
[<Selector xpath='./span/text()' data='Chronometer der Superlative  (COSC + Rol'>]
>>> 

如何将表格放入字典?

1 个答案:

答案 0 :(得分:0)

尝试检查标题和描述值是否存在以及是否没有值-设置默认值:

$(document).on('click','yourtarget',function(){
        var st=window.pageYOffset;
        $('html, body').animate({
                'scrollTop' : st
            });
});