使用BeautifulSoup获取div中div的内容?

时间:2015-12-18 06:09:08

标签: python beautifulsoup

我想通过" gt-read "等级来获取div的内容。并且在div中有另一个具有不同类别的div。以下是脚本代码段:

脚本:

data = """
    <div class='gt-read'>
        <!-- no need -->
        <!-- some no need -->

        <b>Bold text</b> - some text here <br/>
        lorem ipsum here <br/>
        <strong> Author Name</strong>

        <div class='some-class'>
            <script>
                #...
                Js script here
                #...
            </script>
        </div>
    </div>
    """
soup = BeautifulSoup(data, 'lxml')
get_class = soup.find("div", {"class" : "detail_text"})
print 'notices', notices.get_text()
print 'notices', notices

我想要这样的结果:

<b>Bold text</b> - some text here <br/>
lorem ipsum here <br/>
<strong> Author Name</strong>

请帮助。

1 个答案:

答案 0 :(得分:2)

以下内容应显示您的需求:

from bs4 import BeautifulSoup, Comment  

data = """
    <div class='gt-read'>
        <!-- no need -->
        <!-- some no need -->

        <b>Bold text</b> - some text here <br/>
        lorem ipsum here <br/>
        <strong> Author Name</strong>

        <div class='some-class'>
            <script>
                #...
                Js script here
                #...
            </script>
        </div>
    </div>
    """
soup = BeautifulSoup(data, 'lxml')
get_class = soup.find("div", {"class" : "gt-read"})
comments = get_class.find_all(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]

get_class.find("div").extract()
text = get_class.encode_contents().strip()

print text

给你以下输出:

<b>Bold text</b> - some text here <br/>
        lorem ipsum here <br/>
<strong> Author Name</strong>   

这将获取gt-read类,提取所有注释和div标记,并返回剩余的标记。