美丽的汤 - 从HTML文件中取出类

时间:2012-03-12 02:28:44

标签: python html beautifulsoup

我有一个HTML文件,我想抓住这个块中的文本,如下所示:

 <strong class="fullname js-action-profile-name">User Name</strong>
    <span>&rlm;</span>
    <span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>

我希望它显示为:

User Name
@UserName

我如何使用美丽的汤做到这一点?

3 个答案:

答案 0 :(得分:1)

使用“text”属性。例如:

>>> b = BeautifulSoup.BeautifulStoneSoup(open('/tmp/x.html'), convertEntities=BeautifulSoup.BeautifulStoneSoup.HTML_ENTITIES)

>>> print b.find(attrs={"id": "container"}).text
User Name‏@UserName

在x.html中,我有一个div,其中包含您提供的html,其id为“container”。请注意,我使用BeautifulStoneSoup将其转换为\ u200f。要插入换行符(浏览器不会引入),只需将'\ u200f'替换为'\ n'。

答案 1 :(得分:1)

from bs4 import BeautifulSoup

html = '''<strong class="fullname js-action-profile-name">User Name</strong>
    <span>&rlm;</span>
    <span class="username js-action-profile-name"><s>@</s><b>UserName</b></span>'''

soup = BeautifulSoup(html)

username = soup.find(attrs={'class':'username js-action-profile-name'}).text
fullname = soup.find(attrs={'class':'fullname js-action-profile-name'}).text

print fullname
print username

输出:

User Name
@UserName

两个注释:

  1. 如果您正在开始新的/刚学习BS的话,请使用bs4

  2. 您可能会从外部文件加载HTML,因此请将html替换为文件对象。

答案 2 :(得分:0)

这假定 index.html 包含问题的标记:

import BeautifulSoup

def displayUserInfo():

    soup = BeautifulSoup.BeautifulSoup(open("index.html"))
    fullname_ele = soup.find(attrs={"class": "fullname js-action-profile-name"})
    fullname = fullname_ele.contents[0]
    print fullname

    username_ele = soup.find(attrs={"class": "username js-action-profile-name"})
    username = ""
    for child in username_ele.findChildren():
        username += child.contents[0]
    print username

if __name__ == '__main__':
    displayUserInfo()

# prints:
# User Name
# @UserName