I've got a HTML on which I work with BeautifulSoup:
<h1>
<img src="CHN.jpg" alt="image">
Zhuzhou Wide-Ocean Motor
<a class="button" href="/en/top300">
See more information
</a>
</h1>
With a simple select and get_text
soup.select('h1:nth-child(1)')[0].get_text().strip()
I'm getting (\n = newlines)
Zhuzhou Wide-Ocean Motor \n\n\n See more information
But I would like to get rid of the "See more information" which is in <a>
tags.
I've tried to use decompose()
, but it doesn't work on a select result. How can I bring decompose() to work?
答案 0 :(得分:2)
为您提供一些选择。
选项1:
一种解决方法是在“ \ n”处分割,不包含任何空格,然后您将获得每个文本元素的列表。那么在这种情况下,您只需要第一项。
import bs4
html = '''<h1>
<img src="CHN.jpg" alt="image">
Zhuzhou Wide-Ocean Motor
<a class="button" href="/en/top300">
See more information
</a>
</h1>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
text = [ item.strip() for item in soup.text.split('\n') if item.strip() != ''][0]
print (text)
输出:
print (text)
Zhuzhou Wide-Ocean Motor
选项2:
找到该<a>
标签,并获取上一个同级标签:
html = '''<h1>
<img src="CHN.jpg" alt="image">
Zhuzhou Wide-Ocean Motor
<a class="button" href="/en/top300">
See more information
</a>
</h1>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find('a').previousSibling.strip()
print (text)
输出:
print (text)
Zhuzhou Wide-Ocean Motor
选项3:
这可能就是我要解决的方法。
找到<img>
标签,然后获取下一个同级标签:
html = '''<h1>
<img src="CHN.jpg" alt="image">
Zhuzhou Wide-Ocean Motor
<a class="button" href="/en/top300">
See more information
</a>
</h1>'''
soup = bs4.BeautifulSoup(html, 'html.parser')
text = soup.find('img').nextSibling.strip()
print (text)
输出:
print (text)
Zhuzhou Wide-Ocean Motor
答案 1 :(得分:1)
另一个答案已经涵盖了获取所需文本的所有必要技巧。但是,如果您仍然想使用.decompose()
或.extract()
,则以下方法应该起作用:
from bs4 import BeautifulSoup
htmlelem= """
<h1>
<img src="CHN.jpg" alt="image">
Zhuzhou Wide-Ocean Motor
<a class="button" href="/en/top300">
See more information
</a>
</h1>
"""
soup = BeautifulSoup(htmlelem, 'lxml')
[elem.extract() for elem in soup.select("a.button")]
item = soup.select_one("h1").get_text(strip=True)
print(item)
输出:
Zhuzhou Wide-Ocean Motor