Question

我想通过＆＃34; gt-read ＆＃34;等级来获取div的内容。并且在div中有另一个具有不同类别的div。以下是脚本代码段：

脚本：

data = """
    <div class='gt-read'>
        <!-- no need -->
        <!-- some no need -->

        <b>Bold text</b> - some text here <br/>
        lorem ipsum here <br/>
        <strong> Author Name</strong>

        <div class='some-class'>
            <script>
                #...
                Js script here
                #...
            </script>
        </div>
    </div>
    """
soup = BeautifulSoup(data, 'lxml')
get_class = soup.find("div", {"class" : "detail_text"})
print 'notices', notices.get_text()
print 'notices', notices

我想要这样的结果：

<b>Bold text</b> - some text here <br/>
lorem ipsum here <br/>
<strong> Author Name</strong>

请帮助。

Answer 1

以下内容应显示您的需求：

from bs4 import BeautifulSoup, Comment  

data = """
    <div class='gt-read'>
        <!-- no need -->
        <!-- some no need -->

        <b>Bold text</b> - some text here <br/>
        lorem ipsum here <br/>
        <strong> Author Name</strong>

        <div class='some-class'>
            <script>
                #...
                Js script here
                #...
            </script>
        </div>
    </div>
    """
soup = BeautifulSoup(data, 'lxml')
get_class = soup.find("div", {"class" : "gt-read"})
comments = get_class.find_all(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]

get_class.find("div").extract()
text = get_class.encode_contents().strip()

print text

给你以下输出：

<b>Bold text</b> - some text here <br/>
        lorem ipsum here <br/>
<strong> Author Name</strong>

这将获取gt-read类，提取所有注释和div标记，并返回剩余的标记。

使用BeautifulSoup获取div中div的内容？

1 个答案: