如何选择某些标头之前的所有节点?

时间:2019-04-30 20:07:34

标签: python-3.x xpath scrapy css-selectors

每个<header>标签都包含会议标题。 每个<ul>标签都包含此会议的链接。

当我尝试抓取网站时,我尝试将<header>标签与您在<ul>标签中的链接相关联。但是我不知道如何只能选择的<ul>标签来同级两个特定的<headers>

HTML:

<header>... 0 ... </header>
<ul class="publ-list">... 0 ...</ul>
<header>... 1 ... </header> 
<ul class="publ-list">... 0 ...</ul>
<header>... 2 ... </header>
<ul class="publ-list">... 0 ...</ul>
<p>...</p>
<ul class="publ-list">... 1 ...</ul>
<header>... 3 ...</header>
<ul class="publ-list">... 0 ...</ul>
<ul class="publ-list">... 1 ...</ul>
<ul class="publ-list">... 2 ....</ul>
<ul class="publ-list">... 3 ....</ul>
<ul class="publ-list">... 4 ....</ul>
<header>... 4 ...</header>

示例:

  • <ul>标签是header [0]和header [1]的兄弟姐妹

    <ul class="publ-list">... 0 ...</ul>
    
  • <ul>标签是header [2]和header [3]的兄弟姐妹

    <ul class="publ-list">... 0 ...</ul>
    <ul class="publ-list">... 1 ...</ul>
    

某些情况:

  • 标题标签之间可能有多个ul标签
  • 有时在ul标签之间有一个p标签
  • 所有标签都是兄弟姐妹!
  • 所有ul都有“ publ-list”类

我的代码:

TITLE_OF_EDITIONS_SELECTIOR = 'header h2'
GROUP_OF_TYPES_OF_EDITION_SELECTOR = ".publ-list"

size_editions = len(response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR))
i = 0
while i < size_editions:

    # Get the title of conference
    title_edition_conference = response.css(TITLE_OF_EDITIONS_SELECTIOR)[i]


    # Get datas and links of <ul> tags "(.publ-list)"
    TYPES_OF_CONFERENCE = response.css(GROUP_OF_TYPES_OF_EDITION_SELECTOR)[i]
    TYPE = TYPES_OF_CONFERENCE.css('.entry')
    types_of_edition = {}
    size_type_editions = 0
    for type_of_conference in TYPE:
        title_type = type_of_conference.css('.data .title ::text').extract()
        link_type = type_of_conference.css('.publ ul .drop-down .body ul li a ::attr(href)').extract_first()
        types_of_edition[size_type_editions] = {
            "title": title_type,
            "link": link_type,
            }
        size_type_editions = size_type_editions + 1

    editions[i] = {
        "title_edition_conference": title_edition_conference,
        "types_of_edition": types_of_edition
        }
    i = i + 1

我的代码问题

  • 有时ul标签很多
  • 有时有一个<p>标记,它破坏了我的xPath,并且仅获得了先前的<ul>标记。

我在Google Chrome的控制台上使用JQuery进行了测试,例如:

"$($('header')[0]).nextUntil($('header')[1])"

但是如何使用xPath或CSS Selector选择它呢?谢谢!

3 个答案:

答案 0 :(得分:0)

尝试像这里一样使用following-sibling

>>> txt = """<header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <p>...</p>
... <ul class="publ-list">...</ul>
... <header>..</header>
... <ul class="publ-list">...</ul>
... <ul class="publ-list">...</ul>
... <header>..</header>"""
>>> from scrapy import Selector
>>> sel = Selector(text=txt)
>>> sel.xpath('//header/following-sibling::*[not(self::header)]').extract()
[u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<p>...</p>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>', u'<ul class="publ-list">...</ul>']

因此,对于//header/following-sibling::*[not(self::header)],我们选择所有header的兄弟姐妹,而不是header

答案 1 :(得分:0)

这可能是您要寻找的。

html = """
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<p>...</p>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
<ul class="publ-list">...</ul>
<header>..</header>
<ul class="publ-list">...</ul>
"""

请注意,我在第一个<ul>集之前和之后添加了一个<header>..</header>

此表达式

 //ul[   
preceding-sibling::header 
    and 
following-sibling::header
   ]

应该选择所有<ul>标签,除了我之前和之后添加的标签之外,不要选择任何<p>标签。

答案 2 :(得分:0)

下面的css选择器和python for循环组合可以解决此任务。

from parsel import Selector

html  = """
<ul class="publ-list">p1</ul>
<header>h1</header>
<ul class="publ-list">p2</ul>
<header>h2</header>
<ul class="publ-list">p3</ul>
<header>h3</header>
<ul class="publ-list">p4</ul>
<p>p_tag_1</p>
<ul class="publ-list">p5</ul>
<header>h4</header>
<ul class="publ-list">p6</ul>
<ul class="publ-list">p7</ul>
<header>h5</header>
<ul class="publ-list">p8</ul>
"""
response = Selector(text=html)
tags = response.css("header, ul")
output = {}
key = False
for t in tags:
    if key and "<ul" in t.css("*").extract_first():
        output[key].append(t.css("::text").extract_first())
    elif "<header>" in t.css("*").extract_first():
        key = t.css("::text").extract_first()
        if key not in output.keys():
            output[key]=[]
    else:
        pass
print(output)

输出为: {'h1': ['p2'], 'h2': ['p3'], 'h3': ['p4', 'p5'], 'h4': ['p6', 'p7'], 'h5': ['p8']}

此css选择器:tags = response.css("header, ul")按照与HTML代码相同的顺序返回<header><ul>标签的列表。

此后,我们可以使用for循环遍历接收到的标签并选择所需的数据。