更新

Question

很抱歉打扰你这么简单的问题，但是我已经失去了理智。

我试图从以下HTML中获取一个特定信息。在这种情况下，我想要XXXX（文本，更具体）

    <div id="links">
        <h3 id="financial">
            Financial S<span class="linktype">Commer</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Ea</a> | xxxxx<br/>
        <a href="http:" target="_blank">We</a> | xxxx<br/>
        <a href="http:" target="_blank">HQ</a> | xxxxx<br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="services">
            Services<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">To</a> | xxxx <br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="dr">
            Dr<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="physical">
            Phys<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>

我使用BS4来处理它：

    for x in xpto:
        titulo = x.text #to get the Name link. Worked
        link = str(x.get("href")) #To get just the link. Worked too.
        print(titulo)
        print(link)

我的问题是如何获得XXXXX，这是对链接的描述。正如你所看到的，它不是在＆＃39; a＆＃39;中，而是在＆＃34; |＆＃34;之后。我想，在de＆＃34; br /＆＃34;之前（顺便说一句，顺便说一句，我不明白为什么有一个＆＃34; br /＆＃34;如果没有＆＃34; br＆＃34;之前打开它。这是正常的吗？）

我尝试过上一个和下一个兄弟姐妹。

    for x in xpto:
        desc = x.parent.find_next_sibling('a')
        desc2 = x.parent.find_previous_sibling('b')
        print(desc)
        print(desc2)

两人都给了我回复＆＃39;没有＆＃39;结果。有谁知道发生了什么？

更新

想要与另一个循环。像这样的东西;

    for x in xpto:
        titulo = x.text #to get the Name link. Worked
        link = str(x.get("href")) #To get just the link. Worked too.
        desc = x.parent.find_next_sibling('a')
        print(titulo)
        print(desc)
        print(link)

我已经像这样完成了xpto对象

    xpto = links.find_all(['h3', 'a']) #with works with the title and link.

为了能够运行desc对象，我想我应该将de xpto改为这样的东西：

    xpto = links.find_all(['h3', 'a'], a.next.next.strip(' |')) #it would include the thing and after I would be able to do the loop. But I have no idea how to do such a complex findAll.

对不起，伙计们。网络抓取真的很难！

感谢您的帮助= D

btw：python 3.6.1（v3.6.1：69c0db5050，2017年3月21日，01：21：04） Macbook Sierra 10.12.6

Answer 1

您可以只使用next两次，然后剥去您不想要的部分文本。例如：

from bs4 import BeautifulSoup

html = """
    <div id="links">
        <h3 id="financial">
            Financial S<span class="linktype">Commer</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Ea</a> | xxxxx<br/>
        <a href="http:" target="_blank">We</a> | xxxx<br/>
        <a href="http:" target="_blank">HQ</a> | xxxxx<br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="services">
            Services<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">To</a> | xxxx <br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="dr">
            Dr<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>
        <h3 id="physical">
            Phys<span class="linktype">Commercial Links</span>
        </h3>
        <hr/>
        <a href="http:" target="_blank">Eu</a> | xxxx<br/>
        <a href="http:" target="_blank">On</a> | xxxx <br/>
        <div class="up"><a href="#top" title="Back to top">Λ</a></div>"""

soup = BeautifulSoup(html, "html.parser")
div = soup.find('div', id='links')

for el in div.find_all(['a', 'h3']):
    if el.name == 'a':
        if 'target' in el.attrs:        # Only 'a' tags with target
            print("link text '{}', link '{}', desc '{}'".format(el.text, el['href'], el.next.next.strip(' |\n')))
    else:
        el.span.clear()     # Remove 'Commercial Links' (if not needed)
        print("h3_title '{}'".format(el.get_text(strip=True)))

这会显示：

h3_title 'Financial S'
link text 'Ea', link 'http:', desc 'xxxxx'
link text 'We', link 'http:', desc 'xxxx'
link text 'HQ', link 'http:', desc 'xxxxx'
h3_title 'Services'
link text 'To', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Dr'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Phys'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'

有时会看到

<br />，它与XHTML文档一起使用，<br>更常见。

上一个和下一个兄弟问题 - Python - BS4

更新

1 个答案: