从HTML中提取字符串

时间:2017-12-07 11:36:34

标签: python beautifulsoup

我有以下元素:

<div class="column4">
        Unlimited Subscription<br/> Discount for Monthly <br/> Total Amount
    </div>

如何仅使用Beautiful Soup将三个字符串提取为三个不同的元素。不能使用字符串转换和正则表达式:

预期产出:

Unlimited Subscription
Discount for Monthly 
Total Amount

3 个答案:

答案 0 :(得分:2)

要获取单个字符串,您可以获取<select class="multiselect-success" multiple="multiple" name="categories[]"> <option value="1">laravel</option> <option value="2">nodejs</option> <option value="3">php</option> </select> 元素的children并按类型过滤它们。

div

或更短,使用>>> bs = bs4.BeautifulSoup(html) >>> div = bs.find(attrs={"class":"column4"}) >>> [c.strip() for c in div.children if type(c) is bs4.element.NavigableString] ['Unlimited Subscription', 'Discount for Monthly', 'Total Amount'] (如果您不想div.stripped_strings,则只需div.strings):

strip

答案 1 :(得分:0)

如果您希望以上面显示的方式获得输出,那么您可以遵守以下内容:

from bs4 import BeautifulSoup

html_elem ="""
<div class="column4">
    Unlimited Subscription<br/> Discount for Monthly <br/> Total Amount
</div>
"""
soup = BeautifulSoup(html_elem, 'lxml')
for item in soup.select(".column4"):
    for data in item.select("br"):data.replace_with("\n")
    print(item.text.strip())

输出:

Unlimited Subscription
Discount for Monthly 
Total Amount

答案 2 :(得分:-1)

from bs4 import BeautifulSoup
html_doc = """<div class="column4">
        Unlimited Subscription<br/> Discount for Monthly <br/> Total Amount
    </div>
"""
soup = BeautifulSoup(html_doc, 'html.parser')
soup.find("div").text.strip()