如何通过在python进行网页仿冒时仅使用beautifulSoup的一个功能来访问不同博客上的帖子?

时间:2018-07-23 15:27:28

标签: python web-scraping beautifulsoup

第一个博客文章之一的html页面

<div class="entry-content">
		<p>We are under the same sky.</p>
<p>You and I.</p>
<p>I share the soul of earth with you,</p>
<p>to contribute a verse too.</p>
<p>I have words to give,</p>
<p>a smile to offer.</p>
<p>You are at your right place.</p>
<p>You live ,you stay ,you move ,you play.</p>
<p>May also have works to do and words to say.</p>
<p>We may cross each other or not.</p>
<p>But the thing is, we are here,</p>
<p>in this instant;So what, not so clear.</p>
<p>But the powerful play goes on,</p>
<p>for you may contribute a verse.</p>
		<div id="wordads-preview-parent" class="wpcnt">
			<div class="wpa">
				<span class="wpa-about">Advertisements</span>
				<div class="u">
					<div class="wpa-notice">
						<p>Occasionally, some of your visitors may see an advertisement here, <br />as well as a <a href="https://en.support.wordpress.com/cookie-widget/" target="_blank">Privacy & Cookies banner</a> at the bottom of the page.<br/>You can hide ads completely by upgrading to one of our paid plans.</p>
						<p class="wpa-buttons">
							<a class="wpa-button is-primary" id="wordads-preview-more" href="https://wordpress.com/plans/141006071/?feature=no-adverts&utm_campaign=removeadsnotive" rel="nofollow" target="_blank">Upgrade now</a>
							<a class="wpa-button" id="wordads-preview-dismiss" href="#">Dismiss message</a>
						</p>
					</div>
				</div>
			</div>
		</div>

第二篇博文之一的HTML页面

<div class="entry-content">
			<h2><span style="color:#000000;">There are lessons which aren&#8217;t taught</span></h2>
<h2><span style="color:#000000;">Everything black isn&#8217;t always dark<img data-attachment-id="38" data-permalink="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/ea530f2a5c6b48821056deb178ed1747/" data-orig-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg" data-orig-size="500,379" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ea530f2a5c6b48821056deb178ed1747" data-image-description="" data-medium-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" data-large-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=490" class="alignright  wp-image-38" src="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" alt="ea530f2a5c6b48821056deb178ed1747" width="328" height="248" srcset="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&amp;h=248 328w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=150&amp;h=114 150w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=300&amp;h=227 300w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg 500w" sizes="(max-width: 328px) 100vw, 328px" /></span></h2>
<h2><span style="color:#000000;">Everything you love isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you need isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you look isn&#8217;t always watched</span></h2>
<h2><span style="color:#000000;">And everything you do isn&#8217;t always what u did.</span></h2>
<h2><span style="color:#ff0000;">REMEMBER!!!!!</span></h2>
<div id="jp-post-flair" class="sharedaddy sd-like-enabled sd-sharing-enabled"><div class="sharedaddy sd-sharing-enabled"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title">Share this:</h3><div class="sd-content"><ul><li class="share-press-this"><a rel="nofollow" data-shared="" class="share-press-this sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=press-this" rel="noopener noreferrer" target="_blank" title="Click to Press This!"><span>Press This</span></a></li><li class="share-twitter"><a rel="nofollow" data-shared="sharing-twitter-27" class="share-twitter sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=twitter" rel="noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a></li><li class="share-facebook"><a rel="nofollow" data-shared="sharing-facebook-27" class="share-facebook sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=facebook" rel="noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a></li><li class="share-google-plus-1"><a rel="nofollow" data-shared="sharing-google-27" class="share-google-plus-1 sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=google-plus-1" rel="noopener noreferrer" target="_blank" title="Click to share on Google+"><span>Google</span></a></li><li class="share-end"></li></ul></div></div></div><div class='sharedaddy sd-block sd-like jetpack-likes-widget-wrapper jetpack-likes-widget-unloaded' id='like-post-wrapper-127135943-27-5b54d1ab0f8b1' data-src='//widgets.wp.com/likes/index.html?ver=20180319#blog_id=127135943&amp;post_id=27&amp;origin=awistfulwind.wordpress.com&amp;obj_id=127135943-27-5b54d1ab0f8b1' data-name='like-post-frame-127135943-27-5b54d1ab0f8b1'><h3 class='sd-title'>Like this:</h3><div class='likes-widget-placeholder post-likes-widget-placeholder' style='height: 55px;'><span class='button'><span>Like</span></span> <span class="loading">Loading...</span></div><span class='sd-text-color'></span><a class='sd-link-color'></a></div></div>		</div><!-- .entry-content -->
	</div><!-- .entry-body -->

请帮助我仅从此html中剪贴该帖子的内容,该内容可以同时用于我也可以用于其他博客的这两个帖子。

1 个答案:

答案 0 :(得分:0)

主要问题是删除不需要的广告和横幅。我做了一个简单的函数scrap_data(),您在其中提供数据字符串,它将返回报废的内容:

data_1 = """
<div class="entry-content">
        <p>We are under the same sky.</p>
<p>You and I.</p>
<p>I share the soul of earth with you,</p>
<p>to contribute a verse too.</p>
<p>I have words to give,</p>
<p>a smile to offer.</p>
<p>You are at your right place.</p>
<p>You live ,you stay ,you move ,you play.</p>
<p>May also have works to do and words to say.</p>
<p>We may cross each other or not.</p>
<p>But the thing is, we are here,</p>
<p>in this instant;So what, not so clear.</p>
<p>But the powerful play goes on,</p>
<p>for you may contribute a verse.</p>
        <div id="wordads-preview-parent" class="wpcnt">
            <div class="wpa">
                <span class="wpa-about">Advertisements</span>
                <div class="u">
                    <div class="wpa-notice">
                        <p>Occasionally, some of your visitors may see an advertisement here, <br />as well as a <a href="https://en.support.wordpress.com/cookie-widget/" target="_blank">Privacy & Cookies banner</a> at the bottom of the page.<br/>You can hide ads completely by upgrading to one of our paid plans.</p>
                        <p class="wpa-buttons">
                            <a class="wpa-button is-primary" id="wordads-preview-more" href="https://wordpress.com/plans/141006071/?feature=no-adverts&utm_campaign=removeadsnotive" rel="nofollow" target="_blank">Upgrade now</a>
                            <a class="wpa-button" id="wordads-preview-dismiss" href="#">Dismiss message</a>
                        </p>
                    </div>
                </div>
            </div>
        </div>"""

data_2 = """
<div class="entry-content">
            <h2><span style="color:#000000;">There are lessons which aren&#8217;t taught</span></h2>
<h2><span style="color:#000000;">Everything black isn&#8217;t always dark<img data-attachment-id="38" data-permalink="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/ea530f2a5c6b48821056deb178ed1747/" data-orig-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg" data-orig-size="500,379" data-comments-opened="1" data-image-meta="{&quot;aperture&quot;:&quot;0&quot;,&quot;credit&quot;:&quot;&quot;,&quot;camera&quot;:&quot;&quot;,&quot;caption&quot;:&quot;&quot;,&quot;created_timestamp&quot;:&quot;0&quot;,&quot;copyright&quot;:&quot;&quot;,&quot;focal_length&quot;:&quot;0&quot;,&quot;iso&quot;:&quot;0&quot;,&quot;shutter_speed&quot;:&quot;0&quot;,&quot;title&quot;:&quot;&quot;,&quot;orientation&quot;:&quot;0&quot;}" data-image-title="ea530f2a5c6b48821056deb178ed1747" data-image-description="" data-medium-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" data-large-file="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=490" class="alignright  wp-image-38" src="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&#038;h=248" alt="ea530f2a5c6b48821056deb178ed1747" width="328" height="248" srcset="https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=328&amp;h=248 328w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=150&amp;h=114 150w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg?w=300&amp;h=227 300w, https://awistfulwind.files.wordpress.com/2017/04/ea530f2a5c6b48821056deb178ed1747.jpg 500w" sizes="(max-width: 328px) 100vw, 328px" /></span></h2>
<h2><span style="color:#000000;">Everything you love isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you need isn&#8217;t always desired</span></h2>
<h2><span style="color:#000000;">Everything you look isn&#8217;t always watched</span></h2>
<h2><span style="color:#000000;">And everything you do isn&#8217;t always what u did.</span></h2>
<h2><span style="color:#ff0000;">REMEMBER!!!!!</span></h2>
<div id="jp-post-flair" class="sharedaddy sd-like-enabled sd-sharing-enabled"><div class="sharedaddy sd-sharing-enabled"><div class="robots-nocontent sd-block sd-social sd-social-icon-text sd-sharing"><h3 class="sd-title">Share this:</h3><div class="sd-content"><ul><li class="share-press-this"><a rel="nofollow" data-shared="" class="share-press-this sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=press-this" rel="noopener noreferrer" target="_blank" title="Click to Press This!"><span>Press This</span></a></li><li class="share-twitter"><a rel="nofollow" data-shared="sharing-twitter-27" class="share-twitter sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=twitter" rel="noopener noreferrer" target="_blank" title="Click to share on Twitter"><span>Twitter</span></a></li><li class="share-facebook"><a rel="nofollow" data-shared="sharing-facebook-27" class="share-facebook sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=facebook" rel="noopener noreferrer" target="_blank" title="Click to share on Facebook"><span>Facebook</span></a></li><li class="share-google-plus-1"><a rel="nofollow" data-shared="sharing-google-27" class="share-google-plus-1 sd-button share-icon" href="https://awistfulwind.wordpress.com/2017/04/09/a-deeper-perspective/?share=google-plus-1" rel="noopener noreferrer" target="_blank" title="Click to share on Google+"><span>Google</span></a></li><li class="share-end"></li></ul></div></div></div><div class='sharedaddy sd-block sd-like jetpack-likes-widget-wrapper jetpack-likes-widget-unloaded' id='like-post-wrapper-127135943-27-5b54d1ab0f8b1' data-src='//widgets.wp.com/likes/index.html?ver=20180319#blog_id=127135943&amp;post_id=27&amp;origin=awistfulwind.wordpress.com&amp;obj_id=127135943-27-5b54d1ab0f8b1' data-name='like-post-frame-127135943-27-5b54d1ab0f8b1'><h3 class='sd-title'>Like this:</h3><div class='likes-widget-placeholder post-likes-widget-placeholder' style='height: 55px;'><span class='button'><span>Like</span></span> <span class="loading">Loading...</span></div><span class='sd-text-color'></span><a class='sd-link-color'></a></div></div>        </div><!-- .entry-content -->
    </div><!-- .entry-body -->"""

from bs4 import BeautifulSoup

def scrap_data(data):
    soup = BeautifulSoup(data, 'lxml')
    # remvove advertisements
    for div in soup.select('div#wordads-preview-parent'):
        div.clear()
    for div in soup.select('div#jp-post-flair'):
        div.clear()
    return soup.select_one('.entry-content').text.strip()

print(scrap_data(data_1))
print('-' * 80)
print(scrap_data(data_2))
print('-' * 80)

打印:

We are under the same sky.
You and I.
I share the soul of earth with you,
to contribute a verse too.
I have words to give,
a smile to offer.
You are at your right place.
You live ,you stay ,you move ,you play.
May also have works to do and words to say.
We may cross each other or not.
But the thing is, we are here,
in this instant;So what, not so clear.
But the powerful play goes on,
for you may contribute a verse.
--------------------------------------------------------------------------------
There are lessons which aren’t taught
Everything black isn’t always dark
Everything you love isn’t always desired
Everything you need isn’t always desired
Everything you look isn’t always watched
And everything you do isn’t always what u did.
REMEMBER!!!!!
--------------------------------------------------------------------------------