使用Beautifulsoup,提取元素的标签除了指定的那些

时间:2016-07-21 14:51:35

标签: python web-scraping beautifulsoup

我使用Beutifulsoup 4和Python 3.5+来提取webdata。我有以下html,我正在从中提取:

<div class="the-one-i-want">
    <p>
        content
    </p>
    <p>
        content
    </p>
    <p>
        content
    </p>
    <p>
        content
    </p>
    <ol>
        <li>
            list item
        </li>
        <li>
            list item
        </li>
    </ol>
    <div class='something-i-don't-want>
        content
    </div>
    <script class="something-else-i-dont-want'>
        script
    </script>
    <p>
        content
    </p>
</div>

我要提取的所有内容都在<div class="the-one-i-want">元素中找到。现在,我使用以下方法,这些方法大部分时间都在工作:

soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')

这不包括脚本,奇怪的插入div以及其他不可预测的内容,例如广告或推荐的内容&#39;打字。

现在,在某些情况下,除<p>标记之外还有其他元素,其内容对主要内容具有上下文重要性,例如列表。

有没有办法以这样的方式从<div class="the-one-i-want">获取内容:

soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)

desired-content-elements将包含我认为适合该特定内容的每个元素?例如,所有<p>代码,所有<ol><li>代码,但没有<div><script>代码。

也许值得注意的是,我保存内容的方法是:

content_string = ''
for p in content:
    content_string += str(p)

这种方法按照发生的顺序收集数据,如果我通过不同的迭代过程简单地找到不同的元素类型,这将难以管理。我希望不必管理拆分列表的重构,以重新组合内容中最初出现的每个元素的顺序。

3 个答案:

答案 0 :(得分:1)

您可以传递所需的标记列表:

 content = soup.find('div', class_='the-one-i-want').find_all(["p", "ol", "whatever"])

如果我们在您的问题网址上运行类似的内容寻找p和pre标签,您可以看到我们同时获得这两个:

   ...: for ele in soup.select_one("td.postcell").find_all(["pre","p"]):
   ...:     print(ele)
   ...: 

<p>I'm using Beutifulsoup 4 and Python 3.5+ to extract webdata. I have the following html, from which I am extracting:</p>
<pre><code>&lt;div class="the-one-i-want"&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
    &lt;ol&gt;
        &lt;li&gt;
            list item
        &lt;/li&gt;
        &lt;li&gt;
            list item
        &lt;/li&gt;
    &lt;/ol&gt;
    &lt;div class='something-i-don't-want&gt;
        content
    &lt;/div&gt;
    &lt;script class="something-else-i-dont-want'&gt;
        script
    &lt;/script&gt;
    &lt;p&gt;
        content
    &lt;/p&gt;
&lt;/div&gt;
</code></pre>
<p>All of the content that I want to extract is found within the <code>&lt;div class="the-one-i-want"&gt;</code> element. Right now, I'm using the following methods, which work most of the time:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll('p')
</code></pre>
<p>This excludes scripts, weird insert <code>div</code>'s and otherwise un-predictable content such as ads or 'recommended content' type stuff.</p>
<p>Now, there are some instances in which there are elements other than just the <code>&lt;p&gt;</code> tags, which has content that is contextually important to the main content, such as lists.</p>
<p>Is there a way to get the content from the <code>&lt;div class="the-one-i-want"&gt;</code> in a manner as such:</p>
<pre><code>soup = Beautifulsoup(html.text, 'lxml')
content = soup.find('div', class_='the-one-i-want').findAll(desired-content-elements)
</code></pre>
<p>Where <code>desired-content-elements</code>would be inclusive of every element that I deemed fit for that particular content? Such as, all <code>&lt;p&gt;</code> tags, all <code>&lt;ol&gt;</code> and <code>&lt;li&gt;</code> tags, but no <code>&lt;div&gt;</code> or <code>&lt;script&gt;</code> tags.</p>
<p>Perhaps noteworthy, is my method of saving the content:</p>
<pre><code>content_string = ''
for p in content:
    content_string += str(p)
</code></pre>
<p>This approach collects the data, in order of occurrence, which would prove difficult to manage if I simply found different element types through different iteration processes. I'm looking to NOT have to manage re-construction of split lists to re-assemble the order in which each element originally occurred in the content, if possible.</p>

答案 1 :(得分:0)

这对你有用吗?它应该循环添加所需文本的内容,同时忽略div和脚本标记。

for p in content:
    if p.find('div') or p.find('script'):
        continue
    content_string += str(p)

答案 2 :(得分:0)

您可以使用

轻松完成
soup = Beautifulsoup(html.text, 'lxml')
desired-tags = {'div', 'ol'} # add what you need
content = filter(lambda x: x.name in desired-tags
      soup.find('div', class_='the-one-i-want').children)

这将遍历div标记的所有直接子项。如果您希望以递归方式执行此操作(您说了一些关于添加li标记的内容),则应使用.decendants而不是.children。快乐的爬行!