BeautifulSoup4 findChildren()为空

时间:2018-08-15 03:34:53

标签: python html parsing beautifulsoup

我正在尝试使用findChildren()函数。我基本上希望所有<p>在特定的<h3>标签下。我正在尝试简单的代码,但是设置的是children。我要回来的是空的。 h3返回正确的行(请参见print(h3)注释),并且print(type(children))打印类型:<class 'bs4.element.ResultSet'>。请告诉我我在做什么错。

soup = BeautifulSoup(contents, 'html.parser')
h3 = soup.find('h3', text=re.compile('chapter', re.IGNORECASE))
print(h3) #result prints <h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>    
children = h3.findChildren('p')
print(type(children)) #returns type: <class 'bs4.element.ResultSet'>

我也尝试了h3.findChildren('p', Recursive=True)children = h3.findChildren(Recursive=True)。里面也空着回来。

这是我要抓取的HTML部分:

<h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;">
    <span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span>
    <span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span>
    <span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span>
</p>
<p></p>

3 个答案:

答案 0 :(得分:0)

在您提供的示例中,h3节点没有子节点。所有p节点都不在该范围之内。

如果将内容包装在div中(例如),则可以看到您正在使用正确的技术

>>> soup = BeautifulSoup('<div>' + contents + '</div>', 'html.parser')
>>> div = soup.find('div')
>>> div.findChildren('p')
[<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span></p>, <p> </p>]
>>> 

编辑

正如您在上面的评论中提到的,h3p节点是您提供的内容中的同级对象。我不确定将p的子元素作为h3的元素是否有意义,但是如果您这样做,它将看起来像

<h3>
This content is within the h3 tag
<p>this is a child of h3</p>
<p>another child</p>
</h3>
<p>this is not a child of h3 as it is after the h3 close tag</p>

目前尚不清楚在示例内容中选择p节点的条件是什么-一个简单的soup.find('p')会返回所有这些标签,但我怀疑您需要以某种方式对其进行限制以防止包含其他内容。你能详细说明吗?您可能只想要像这样的东西:

>>> soup = BeautifulSoup(content, 'html.parser')
>>> h3 = soup.find('h3')
>>> h3.find_next_sibling('p')
<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;">
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span>
</p>

答案 1 :(得分:0)

感谢那些回应。我的问题是<h3>和子<p>是兄弟姐妹而不是父母/孩子。我认为这些帖子是我在代码方面的专长,但上面我的评论仍然存在。 http://stackoverflow.com/questions/51571609/…和http://stackoverflow.com/questions/51852588/

答案 2 :(得分:0)

感谢您的耐心配合。我必须弄清楚如何获取html结构,整理html并写入文件以更好地查看关系等。我需要处理的页面(我没有编写它们)具有如下结构。构建bs4结构后,我发现所需的内容从<article..>标记开始,到下一个<script...> code here</<script> <h3>Comments</h3>的开始结束。我不确定如何终止两个不同标签之间的搜索。我能够抓住<h3>标记和下一个<h3>标记之间的所有内容。但这拉开了我不想要的<script>部分。再次感谢您的持续帮助! -梅根(Meghan)

....
<div id="rt-main" class="sa3-mb9">
                <div class="rt-container">
                    <div class="rt-grid-9 rt-push-3">
                                                                        <div class="rt-block">
                            <div id="rt-mainbody">
                                <div class="component-content">
                                    <article class="item-pageDarkening">
<h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
<p>&nbsp;</p>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">text.. ż/span></p>
<p>&nbsp;</p>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">text here</span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"></span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span></p>
<p>&nbsp;</p>
<p>dljlg</p>
<span></span>
<p>dljlg</p>
<span></span>
   <p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><em><span style="font-size: 16px; font-family: 'arial black', 'avant garde'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">&nbsp;</span></em></p> 

        <script type='text/javascript'>
Komento.ready(function($) {
    // declare master namespace variable for shared values
    Komento.component   = "com_content";
    Komento.cid         = "1211";
    Komento.contentLink = "...";
    Komento.sort        = "latest";
    Komento.loadedCount = parseInt(10);
    Komento.totalCount  = parseInt(56);

    if( Komento.options.konfig.enable_shorten_link == 0 ) {
        Komento.shortenLink = Komento.contentLink;
    }
});
</script>

<div id="section-kmt" class="theme-kuro">

    <script type="text/javascript">
    Komento.require()
    .library('dialog')
    .script(
        'komento.language',
        'komento.common',
        'komento.commentform'
    )
    .done(function($) {
        if($('.commentForm').exists()) {
            Komento.options.element.form = new Komento.Controller.CommentForm($('.commentForm'));
            Komento.options.element.form.kmt = Komento.options.element;
        }
    });

    </script>
    <div id="kmt-form" class="commentForm kmt-form clearfix">
                <a class="addCommentButton kmt-form-addbutton" href="javascript:void(0);"><b>Add comment</b></a>
                <div class="formArea kmt-form-area hidden">
            <h3 class="kmt-title">Leave your comments</h3>