我想通过" gt-read "等级来获取div的内容。并且在div中有另一个具有不同类别的div。以下是脚本代码段:
脚本:
data = """
<div class='gt-read'>
<!-- no need -->
<!-- some no need -->
<b>Bold text</b> - some text here <br/>
lorem ipsum here <br/>
<strong> Author Name</strong>
<div class='some-class'>
<script>
#...
Js script here
#...
</script>
</div>
</div>
"""
soup = BeautifulSoup(data, 'lxml')
get_class = soup.find("div", {"class" : "detail_text"})
print 'notices', notices.get_text()
print 'notices', notices
我想要这样的结果:
<b>Bold text</b> - some text here <br/>
lorem ipsum here <br/>
<strong> Author Name</strong>
请帮助。
答案 0 :(得分:2)
以下内容应显示您的需求:
from bs4 import BeautifulSoup, Comment
data = """
<div class='gt-read'>
<!-- no need -->
<!-- some no need -->
<b>Bold text</b> - some text here <br/>
lorem ipsum here <br/>
<strong> Author Name</strong>
<div class='some-class'>
<script>
#...
Js script here
#...
</script>
</div>
</div>
"""
soup = BeautifulSoup(data, 'lxml')
get_class = soup.find("div", {"class" : "gt-read"})
comments = get_class.find_all(text=lambda text:isinstance(text, Comment))
[comment.extract() for comment in comments]
get_class.find("div").extract()
text = get_class.encode_contents().strip()
print text
给你以下输出:
<b>Bold text</b> - some text here <br/>
lorem ipsum here <br/>
<strong> Author Name</strong>
这将获取gt-read
类,提取所有注释和div标记,并返回剩余的标记。