删除(分解)<a> element from BeautifulSoup select result

时间:2019-02-02 13:21:50

标签: python web-scraping beautifulsoup

I've got a HTML on which I work with BeautifulSoup:

<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>

With a simple select and get_text

soup.select('h1:nth-child(1)')[0].get_text().strip()

I'm getting (\n = newlines)

Zhuzhou Wide-Ocean Motor \n\n\n See more information

But I would like to get rid of the "See more information" which is in <a> tags. I've tried to use decompose(), but it doesn't work on a select result. How can I bring decompose() to work?

2 个答案:

答案 0 :(得分:2)

为您提供一些选择。

选项1:

一种解决方法是在“ \ n”处分割,不包含任何空格,然后您将获得每个文本元素的列表。那么在这种情况下,您只需要第一项。

import bs4

html = '''<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>'''


soup = bs4.BeautifulSoup(html, 'html.parser')

text = [ item.strip() for item in soup.text.split('\n') if item.strip() != ''][0]

print (text)

输出:

print (text)
Zhuzhou Wide-Ocean Motor

选项2:

找到该<a>标签,并获取上一个同级标签:

html = '''<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>'''


soup = bs4.BeautifulSoup(html, 'html.parser')

text = soup.find('a').previousSibling.strip()
print (text)

输出:

print (text)
Zhuzhou Wide-Ocean Motor

选项3:

这可能就是我要解决的方法。 找到<img>标签,然后获取下一个同级标签:

html = '''<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>'''


soup = bs4.BeautifulSoup(html, 'html.parser')

text = soup.find('img').nextSibling.strip()
print (text)

输出:

print (text)
Zhuzhou Wide-Ocean Motor

答案 1 :(得分:1)

另一个答案已经涵盖了获取所需文本的所有必要技巧。但是,如果您仍然想使用.decompose().extract(),则以下方法应该起作用:

from bs4 import BeautifulSoup

htmlelem= """
<h1>
    <img src="CHN.jpg" alt="image">
        Zhuzhou Wide-Ocean Motor
    <a class="button" href="/en/top300">
        See more information
    </a>                    
</h1>
"""

soup = BeautifulSoup(htmlelem, 'lxml')
[elem.extract() for elem in soup.select("a.button")]
item = soup.select_one("h1").get_text(strip=True)
print(item)

输出:

Zhuzhou Wide-Ocean Motor