很抱歉打扰你这么简单的问题,但是我已经失去了理智。
我试图从以下HTML中获取一个特定信息。在这种情况下,我想要XXXX(文本,更具体)
<div id="links">
<h3 id="financial">
Financial S<span class="linktype">Commer</span>
</h3>
<hr/>
<a href="http:" target="_blank">Ea</a> | xxxxx<br/>
<a href="http:" target="_blank">We</a> | xxxx<br/>
<a href="http:" target="_blank">HQ</a> | xxxxx<br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="services">
Services<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">To</a> | xxxx <br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="dr">
Dr<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="physical">
Phys<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
我使用BS4来处理它:
for x in xpto:
titulo = x.text #to get the Name link. Worked
link = str(x.get("href")) #To get just the link. Worked too.
print(titulo)
print(link)
我的问题是如何获得XXXXX,这是对链接的描述。正如你所看到的,它不是在&#39; a&#39;中,而是在&#34; |&#34;之后。我想,在de&#34; br /&#34;之前(顺便说一句,顺便说一句,我不明白为什么有一个&#34; br /&#34;如果没有&#34; br&#34;之前打开它。这是正常的吗?)
我尝试过上一个和下一个兄弟姐妹。
for x in xpto:
desc = x.parent.find_next_sibling('a')
desc2 = x.parent.find_previous_sibling('b')
print(desc)
print(desc2)
两人都给了我回复&#39;没有&#39;结果。有谁知道发生了什么?
想要与另一个循环。像这样的东西;
for x in xpto:
titulo = x.text #to get the Name link. Worked
link = str(x.get("href")) #To get just the link. Worked too.
desc = x.parent.find_next_sibling('a')
print(titulo)
print(desc)
print(link)
我已经像这样完成了xpto对象
xpto = links.find_all(['h3', 'a']) #with works with the title and link.
为了能够运行desc对象,我想我应该将de xpto改为这样的东西:
xpto = links.find_all(['h3', 'a'], a.next.next.strip(' |')) #it would include the thing and after I would be able to do the loop. But I have no idea how to do such a complex findAll.
对不起,伙计们。网络抓取真的很难!
感谢您的帮助= D
btw:python 3.6.1(v3.6.1:69c0db5050,2017年3月21日,01:21:04) Macbook Sierra 10.12.6
答案 0 :(得分:0)
您可以只使用next
两次,然后剥去您不想要的部分文本。例如:
from bs4 import BeautifulSoup
html = """
<div id="links">
<h3 id="financial">
Financial S<span class="linktype">Commer</span>
</h3>
<hr/>
<a href="http:" target="_blank">Ea</a> | xxxxx<br/>
<a href="http:" target="_blank">We</a> | xxxx<br/>
<a href="http:" target="_blank">HQ</a> | xxxxx<br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="services">
Services<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">To</a> | xxxx <br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="dr">
Dr<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>
<h3 id="physical">
Phys<span class="linktype">Commercial Links</span>
</h3>
<hr/>
<a href="http:" target="_blank">Eu</a> | xxxx<br/>
<a href="http:" target="_blank">On</a> | xxxx <br/>
<div class="up"><a href="#top" title="Back to top">Λ</a></div>"""
soup = BeautifulSoup(html, "html.parser")
div = soup.find('div', id='links')
for el in div.find_all(['a', 'h3']):
if el.name == 'a':
if 'target' in el.attrs: # Only 'a' tags with target
print("link text '{}', link '{}', desc '{}'".format(el.text, el['href'], el.next.next.strip(' |\n')))
else:
el.span.clear() # Remove 'Commercial Links' (if not needed)
print("h3_title '{}'".format(el.get_text(strip=True)))
这会显示:
h3_title 'Financial S'
link text 'Ea', link 'http:', desc 'xxxxx'
link text 'We', link 'http:', desc 'xxxx'
link text 'HQ', link 'http:', desc 'xxxxx'
h3_title 'Services'
link text 'To', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Dr'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
h3_title 'Phys'
link text 'Eu', link 'http:', desc 'xxxx'
link text 'On', link 'http:', desc 'xxxx'
有时会看到 <br />
,它与XHTML文档一起使用,<br>
更常见。