Question

我有一个HTML文件，我想抓住这个块中的文本，如下所示：

 <strong class="fullname js-action-profile-name">User Name</strong>
    <span>&rlm;</span>
    <span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>

我希望它显示为：

User Name
@UserName

我如何使用美丽的汤做到这一点？

Answer 1

使用“text”属性。例如：

>>> b = BeautifulSoup.BeautifulStoneSoup(open('/tmp/x.html'), convertEntities=BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES)

>>> print b.find(attrs={"id": "container"}).text
User Name‏@UserName

在x.html中，我有一个div，其中包含您提供的html，其id为“container”。请注意，我使用BeautifulStoneSoup将其转换为\ u200f。要插入换行符（浏览器不会引入），只需将'\ u200f'替换为'\ n'。

Answer 2

from bs4 import BeautifulSoup

html = '''<strong class="fullname js-action-profile-name">User Name</strong>
    <span>&rlm;</span>
    <span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>'''

soup = BeautifulSoup(html)

username = soup.find(attrs={'class':'username js-action-profile-name'}).text
fullname = soup.find(attrs={'class':'fullname js-action-profile-name'}).text

print fullname
print username

输出：

User Name
@UserName

两个注释：

如果您正在开始新的/刚学习BS的话，请使用bs4。
您可能会从外部文件加载HTML，因此请将html替换为文件对象。

Answer 3

这假定 index.html 包含问题的标记：

import BeautifulSoup

def displayUserInfo():

    soup = BeautifulSoup.BeautifulSoup(open("index.html"))
    fullname_ele = soup.find(attrs={"class": "fullname js-action-profile-name"})
    fullname = fullname_ele.contents[0]
    print fullname

    username_ele = soup.find(attrs={"class": "username js-action-profile-name"})
    username = ""
    for child in username_ele.findChildren():
        username += child.contents[0]
    print username

if __name__ == '__main__':
    displayUserInfo()

# prints:
# User Name
# @UserName

美丽的汤 - 从HTML文件中取出类

3 个答案: