使用BeautifulSoup从单个博客存档页面中提取多个帖子,无需脚本

时间:2014-07-01 03:41:19

标签: python html python-2.7 html-parsing beautifulsoup

我正在尝试从一系列WordPress和Blogger博客存档页面中删除作者,标题,日期和发布内容。我已经保存了页面,所以我没有反复ping服务器。我已经让其他部分正常工作了,但我似乎无法从中获取所有帖子并且也没有得到“add-to-any”或从底部“社交”或其他凌乱的剧本。我就在这里。

import urllib2
from bs4 import BeautifulSoup
import re

file_list = open ("hafiles.txt", "r")
posts_file = open ("haposts.txt","w")


for indurl in file_list:
    indurl = indurl.rstrip("\n")
    with open(indurl,"r") as ha_file:
     soup_ha = BeautifulSoup(ha_file)

    #works the second find gets rid of the sociable crap
    # this is the way it looks on the page <div class='post-body'>

    posts = soup_ha.find("div", class_="post-body").find_all("p")


    #tried a trick i saw on http://stackoverflow.com/questions/24458353/cleaning-text-string-after-getting-body-text-using-beautifulsoup
    #no joy
    #posts = soup_ha.find("div", class_="post-body")
    #text = [''.join(s.findAll(text=True))for s in posts.findAll('p')] 
    text = str(posts) + "\n" + "\n"
    posts_file.write (text)

print ("All done!")



file_list.close()
posts_file.close()

因此,如果我执行find_all并获取所有帖子(甚至不确定我实际上是全部获取它们),那么我会得到脚本。如果我只使用find,我可以通过至少两种方式获得漂亮的帖子。我有一个文件列表,每个文件都有几个要提取的帖子。 我在这里搜索了stackoverflow和网络。

eta:输入是一个非常混乱的网页,顶部有大量脚本,页面上的所有css定义,然后

<div id='main-wrapper'>
<div class='main section' id='main'><div class='widget Blog' id='Blog1'>
<div class='blog-posts'>
<h2 class='date-header'>27 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
<h3 class='post-title'>
<a href='http:// edited for anon.html'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit this is post text - what i want</p>
<script type='text/javascript'>
          var permlink='edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
this is the author name, also want, have way to get
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='http://edit' title='permanent link'>2:53 pm</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>1 comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edi</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>26 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='5518681505930320089'></a>
<h3 class='post-title'>
<a href='edit'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit post text, what I want.</p>
<script type='text/javascript'>
          var permlink='http://edit';
          var title='edit';

          var spans = document.getElementsByTagName('span');
          var number = 0;
          for(i=0; i <spans.length; i++){
                var c = " " + spans[i].className + " ";
                if (c.indexOf("fullpost") != -1) {
                number++;
                }
                }

                if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
           memory = number;
           </script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
edit author name
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='edit' title='permanent link'>9:00 am</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>5
comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edit</a>,
<a href='edit' rel='tag'>edit</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>22 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>

呸!所以我可能有20个左右的文件,每个文件中有1到10个帖子(这有2个)...可爱的是csv或excel文件就像 date author title postcontent

在列中,每行一行。 我会带一个文件只包含帖子内容,每个帖子之间有一些空格。我很好用文本中的一些链接和一些粗体和列表等等,但我不想要所有杂乱的脚本。 感谢

1 个答案:

答案 0 :(得分:1)

以下是包含多个帖子的单个页面的示例:

from bs4 import BeautifulSoup


soup = BeautifulSoup(open('test.html'))
posts = []
for post in soup.find_all('div', class_='post'):
    title = post.find('h3', class_='post-title').text.strip()
    author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
    content = post.find('div', class_='post-body').p.text.strip()
    date = post.find_previous_sibling('h2', class_='date-header').text.strip()

    posts.append({'title': title,
                  'author': author,
                  'content': content,
                  'date': date})
print posts

对于您发布的html,它会打印:

[{'content': u'edit this is post text - what i want', 
  'date': u'27 February, 2007', 
  'author': u'this is the author name, also want, have way to get', 
  'title': u'edit'}, 
 {'content': u'edit post text, what I want.', 
  'date': u'26 February, 2007', 
  'author': u'edit author name', 
  'title': u'edit'}]