Question

I've got a HTML on which I work with BeautifulSoup:

<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>

With a simple select and get_text

soup.select('h1:nth-child(1)')[0].get_text().strip()

I'm getting (\n = newlines)

Zhuzhou Wide-Ocean Motor \n\n\n See more information

But I would like to get rid of the "See more information" which is in <a> tags. I've tried to use decompose(), but it doesn't work on a select result. How can I bring decompose() to work?

Answer 1

为您提供一些选择。

选项1：

一种解决方法是在“ \ n”处分割，不包含任何空格，然后您将获得每个文本元素的列表。那么在这种情况下，您只需要第一项。

import bs4

html = '''<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>'''


soup = bs4.BeautifulSoup(html, 'html.parser')

text = [ item.strip() for item in soup.text.split('\n') if item.strip() != ''][0]

print (text)

输出：

print (text)
Zhuzhou Wide-Ocean Motor

选项2：

找到该<a>标签，并获取上一个同级标签：

html = '''<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>'''


soup = bs4.BeautifulSoup(html, 'html.parser')

text = soup.find('a').previousSibling.strip()
print (text)

输出：

print (text)
Zhuzhou Wide-Ocean Motor

选项3：

这可能就是我要解决的方法。找到<img>标签，然后获取下一个同级标签：

html = '''<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>'''


soup = bs4.BeautifulSoup(html, 'html.parser')

text = soup.find('img').nextSibling.strip()
print (text)

输出：

print (text)
Zhuzhou Wide-Ocean Motor

Answer 2

另一个答案已经涵盖了获取所需文本的所有必要技巧。但是，如果您仍然想使用.decompose()或.extract()，则以下方法应该起作用：

from bs4 import BeautifulSoup

htmlelem= """
<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>
"""

soup = BeautifulSoup(htmlelem, 'lxml')
[elem.extract() for elem in soup.select("a.button")]
item = soup.select_one("h1").get_text(strip=True)
print(item)

输出：

Zhuzhou Wide-Ocean Motor

删除（分解）<a> element from BeautifulSoup select result

2 个答案: