我正在尝试使用findChildren()
函数。我基本上希望所有<p>
在特定的<h3>
标签下。我正在尝试简单的代码,但是设置的是children
。我要回来的是空的。 h3
返回正确的行(请参见print(h3)
注释),并且print(type(children))
打印类型:<class 'bs4.element.ResultSet'>
。请告诉我我在做什么错。
soup = BeautifulSoup(contents, 'html.parser')
h3 = soup.find('h3', text=re.compile('chapter', re.IGNORECASE))
print(h3) #result prints <h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
children = h3.findChildren('p')
print(type(children)) #returns type: <class 'bs4.element.ResultSet'>
我也尝试了h3.findChildren('p', Recursive=True)
和children = h3.findChildren(Recursive=True)
。里面也空着回来。
这是我要抓取的HTML部分:
<h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;">
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span>
</p>
<p></p>
答案 0 :(得分:0)
在您提供的示例中,h3
节点没有子节点。所有p
节点都不在该范围之内。
如果将内容包装在div
中(例如),则可以看到您正在使用正确的技术
>>> soup = BeautifulSoup('<div>' + contents + '</div>', 'html.parser')
>>> div = soup.find('div')
>>> div.findChildren('p')
[<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span></p>, <p> </p>]
>>>
编辑
正如您在上面的评论中提到的,h3
和p
节点是您提供的内容中的同级对象。我不确定将p
的子元素作为h3
的元素是否有意义,但是如果您这样做,它将看起来像
<h3>
This content is within the h3 tag
<p>this is a child of h3</p>
<p>another child</p>
</h3>
<p>this is not a child of h3 as it is after the h3 close tag</p>
目前尚不清楚在示例内容中选择p
节点的条件是什么-一个简单的soup.find('p')
会返回所有这些标签,但我怀疑您需要以某种方式对其进行限制以防止包含其他内容。你能详细说明吗?您可能只想要像这样的东西:
>>> soup = BeautifulSoup(content, 'html.parser')
>>> h3 = soup.find('h3')
>>> h3.find_next_sibling('p')
<p dir="ltr" style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;">
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">Stephanus Grayland did not try to hide his smile of satisfaction . He had “eaten” lunch, but now, he sensed, he would truly </span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">feast</span>
<span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span>
</p>
答案 1 :(得分:0)
感谢那些回应。我的问题是<h3>
和子<p>
是兄弟姐妹而不是父母/孩子。我认为这些帖子是我在代码方面的专长,但上面我的评论仍然存在。 http://stackoverflow.com/questions/51571609/…和http://stackoverflow.com/questions/51852588/
答案 2 :(得分:0)
感谢您的耐心配合。我必须弄清楚如何获取html结构,整理html并写入文件以更好地查看关系等。我需要处理的页面(我没有编写它们)具有如下结构。构建bs4结构后,我发现所需的内容从<article..>
标记开始,到下一个<script...> code here</<script> <h3>Comments</h3>
的开始结束。我不确定如何终止两个不同标签之间的搜索。我能够抓住<h3>
标记和下一个<h3>
标记之间的所有内容。但这拉开了我不想要的<script>
部分。再次感谢您的持续帮助! -梅根(Meghan)
....
<div id="rt-main" class="sa3-mb9">
<div class="rt-container">
<div class="rt-grid-9 rt-push-3">
<div class="rt-block">
<div id="rt-mainbody">
<div class="component-content">
<article class="item-pageDarkening">
<h3 style="text-align: center;">CHAPTER ONE - STEPHANUS GRAYLAND</h3>
<p> </p>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;">text.. ż/span></p>
<p> </p>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">text here</span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; font-style: italic; vertical-align: baseline; white-space: pre-wrap;"></span><span style="font-size: 16px; font-family: 'Times New Roman'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;">.</span></p>
<p> </p>
<p>dljlg</p>
<span></span>
<p>dljlg</p>
<span></span>
<p style="line-height: 1.15; margin-top: 0pt; margin-bottom: 0pt;" dir="ltr"><em><span style="font-size: 16px; font-family: 'arial black', 'avant garde'; background-color: transparent; vertical-align: baseline; white-space: pre-wrap;"> </span></em></p>
<script type='text/javascript'>
Komento.ready(function($) {
// declare master namespace variable for shared values
Komento.component = "com_content";
Komento.cid = "1211";
Komento.contentLink = "...";
Komento.sort = "latest";
Komento.loadedCount = parseInt(10);
Komento.totalCount = parseInt(56);
if( Komento.options.konfig.enable_shorten_link == 0 ) {
Komento.shortenLink = Komento.contentLink;
}
});
</script>
<div id="section-kmt" class="theme-kuro">
<script type="text/javascript">
Komento.require()
.library('dialog')
.script(
'komento.language',
'komento.common',
'komento.commentform'
)
.done(function($) {
if($('.commentForm').exists()) {
Komento.options.element.form = new Komento.Controller.CommentForm($('.commentForm'));
Komento.options.element.form.kmt = Komento.options.element;
}
});
</script>
<div id="kmt-form" class="commentForm kmt-form clearfix">
<a class="addCommentButton kmt-form-addbutton" href="javascript:void(0);"><b>Add comment</b></a>
<div class="formArea kmt-form-area hidden">
<h3 class="kmt-title">Leave your comments</h3>