上一个和下一个兄弟问题 - Python - BS4

时间:2017-11-06 13:50:42

标签: python web-scraping beautifulsoup

很抱歉打扰你这么简单的问题,但是我已经失去了理智。

我试图从以下HTML中获取一个特定信息。在这种情况下,我想要XXXX(文本,更具体)

    <div id="links">
        <h3 id="financial">
            Financial S<span class="linktype">Commer</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Ea</a> | xxxxx<br/>
        <a href="http:" target="_blank">We</a> | xxxx<br/>
        <a href="http:" target="_blank">HQ</a> | xxxxx<br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="services">
            Services<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">To</a> | xxxx <br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="dr">
            Dr<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="physical">
            Phys<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>

我使用BS4来处理它:

    for x in xpto:
        titulo = x.text #to get the Name link. Worked
        link = str(x.get("href")) #To get just the link. Worked too.
        print(titulo)
        print(link)

我的问题是如何获得XXXXX,这是对链接的描述。正如你所看到的,它不是在&#39; a&#39;中,而是在&#34; |&#34;之后。我想,在de&#34; br /&#34;之前(顺便说一句,顺便说一句,我不明白为什么有一个&#34; br /&#34;如果没有&#34; br&#34;之前打开它。这是正常的吗?)

我尝试过上一个和下一个兄弟姐妹。

    for x in xpto:
        desc = x.parent.find_next_sibling('a')
        desc2 = x.parent.find_previous_sibling('b')
        print(desc)
        print(desc2)

两人都给了我回复&#39;没有&#39;结果。有谁知道发生了什么?

更新

想要与另一个循环。像这样的东西;

    for x in xpto:
        titulo = x.text #to get the Name link. Worked
        link = str(x.get("href")) #To get just the link. Worked too.
        desc = x.parent.find_next_sibling('a')
        print(titulo)
        print(desc)
        print(link)

我已经像这样完成了xpto对象

    xpto = links.find_all(['h3', 'a']) #with works with the title and link.

为了能够运行desc对象,我想我应该将de xpto改为这样的东西:

    xpto = links.find_all(['h3', 'a'], a.next.next.strip(' |')) #it would include the thing and after I would be able to do the loop. But I have no idea how to do such a complex findAll.

对不起,伙计们。网络抓取真的很难!

感谢您的帮助= D

btw:python 3.6.1(v3.6.1:69c0db5050,2017年3月21日,01:21:04) Macbook Sierra 10.12.6

1 个答案:

答案 0 :(得分:0)

您可以只使用next两次,然后剥去您不想要的部分文本。例如:

from bs4 import BeautifulSoup

html = """
    <div id="links">
        <h3 id="financial">
            Financial S<span class="linktype">Commer</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Ea</a> | xxxxx<br/>
        <a href="http:" target="_blank">We</a> | xxxx<br/>
        <a href="http:" target="_blank">HQ</a> | xxxxx<br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="services">
            Services<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">To</a> | xxxx <br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="dr">
            Dr<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="physical">
            Phys<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>"""

soup = BeautifulSoup(html, "html.parser")
div = soup.find('div', id='links')

for el in div.find_all(['a', 'h3']):
    if el.name == 'a':
        if 'target' in el.attrs:        # Only 'a' tags with target
            print("link text '{}', link '{}', desc '{}'".format(el.text, el['href'], el.next.next.strip(' |\n')))
    else:
        el.span.clear()     # Remove 'Commercial Links' (if not needed)
        print("h3_title '{}'".format(el.get_text(strip=True)))

这会显示:

h3_title 'Financial S'
link text 'Ea', link 'http:', desc 'xxxxx'
link text 'We', link 'http:', desc 'xxxx'
link text 'HQ', link 'http:', desc 'xxxxx'
h3_title 'Services'
link text 'To', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Dr'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Phys'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
有时会看到

<br />,它与XHTML文档一起使用,<br>更常见。