我正在尝试从一系列WordPress和Blogger博客存档页面中删除作者,标题,日期和发布内容。我已经保存了页面,所以我没有反复ping服务器。我已经让其他部分正常工作了,但我似乎无法从中获取所有帖子并且也没有得到“add-to-any”或从底部“社交”或其他凌乱的剧本。我就在这里。
import urllib2
from bs4 import BeautifulSoup
import re
file_list = open ("hafiles.txt", "r")
posts_file = open ("haposts.txt","w")
for indurl in file_list:
indurl = indurl.rstrip("\n")
with open(indurl,"r") as ha_file:
soup_ha = BeautifulSoup(ha_file)
#works the second find gets rid of the sociable crap
# this is the way it looks on the page <div class='post-body'>
posts = soup_ha.find("div", class_="post-body").find_all("p")
#tried a trick i saw on http://stackoverflow.com/questions/24458353/cleaning-text-string-after-getting-body-text-using-beautifulsoup
#no joy
#posts = soup_ha.find("div", class_="post-body")
#text = [''.join(s.findAll(text=True))for s in posts.findAll('p')]
text = str(posts) + "\n" + "\n"
posts_file.write (text)
print ("All done!")
file_list.close()
posts_file.close()
因此,如果我执行find_all并获取所有帖子(甚至不确定我实际上是全部获取它们),那么我会得到脚本。如果我只使用find,我可以通过至少两种方式获得漂亮的帖子。我有一个文件列表,每个文件都有几个要提取的帖子。 我在这里搜索了stackoverflow和网络。
eta:输入是一个非常混乱的网页,顶部有大量脚本,页面上的所有css定义,然后
<div id='main-wrapper'>
<div class='main section' id='main'><div class='widget Blog' id='Blog1'>
<div class='blog-posts'>
<h2 class='date-header'>27 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
<h3 class='post-title'>
<a href='http:// edited for anon.html'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit this is post text - what i want</p>
<script type='text/javascript'>
var permlink='edit';
var title='edit';
var spans = document.getElementsByTagName('span');
var number = 0;
for(i=0; i <spans.length; i++){
var c = " " + spans[i].className + " ";
if (c.indexOf("fullpost") != -1) {
number++;
}
}
if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
memory = number;
</script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
this is the author name, also want, have way to get
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='http://edit' title='permanent link'>2:53 pm</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>1 comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edi</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>26 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='5518681505930320089'></a>
<h3 class='post-title'>
<a href='edit'>edit</a>
</h3>
<div class='post-header-line-1'></div>
<div class='post-body'>
<style>span.fullpost{display:none;}</style>
<p>edit post text, what I want.</p>
<script type='text/javascript'>
var permlink='http://edit';
var title='edit';
var spans = document.getElementsByTagName('span');
var number = 0;
for(i=0; i <spans.length; i++){
var c = " " + spans[i].className + " ";
if (c.indexOf("fullpost") != -1) {
number++;
}
}
if(number != memory){document.write('<p></p><a href=' + permlink + '>"'+ title + '" continues...</a>') }
memory = number;
</script>
<div style='clear: both;'></div>
</div>
<div class='post-footer'>
<p class='post-footer-line post-footer-line-1'>
<span class='post-author'>
Posted by
edit author name
</span>
<span class='post-timestamp'>
at
<a class='timestamp-link' href='edit' title='permanent link'>9:00 am</a>
</span>
<span class='post-comment-link'>
<a class='comment-link' href='edit' onclick=''>5
comments</a>
</span>
<span class='post-backlinks post-comment-link'>
<a class='comment-link' href='edit'>Links to this post</a>
</span>
<span class='post-icons'>
<span class='item-control blog-admin pid-edit'>
<a href='edit' title='Edit Post'>
<img alt='' class='icon-action' height='18' src='http://img2.blogblog.com/img/icon18_edit_allbkg.gif' width='18'/>
</a>
</span>
</span>
</p>
<p class='post-footer-line post-footer-line-2'>
<span class='post-labels'>
Labels:
<a href='edit' rel='tag'>edit</a>,
<a href='edit' rel='tag'>edit</a>
</span>
</p>
<p class='post-footer-line post-footer-line-3'></p>
</div>
</div>
<h2 class='date-header'>22 February, 2007</h2>
<div class='post uncustomized-post-template'>
<a name='edit'></a>
呸!所以我可能有20个左右的文件,每个文件中有1到10个帖子(这有2个)...可爱的是csv或excel文件就像 date author title postcontent
在列中,每行一行。 我会带一个文件只包含帖子内容,每个帖子之间有一些空格。我很好用文本中的一些链接和一些粗体和列表等等,但我不想要所有杂乱的脚本。 感谢
答案 0 :(得分:1)
以下是包含多个帖子的单个页面的示例:
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('test.html'))
posts = []
for post in soup.find_all('div', class_='post'):
title = post.find('h3', class_='post-title').text.strip()
author = post.find('span', class_='post-author').text.replace('Posted by', '').strip()
content = post.find('div', class_='post-body').p.text.strip()
date = post.find_previous_sibling('h2', class_='date-header').text.strip()
posts.append({'title': title,
'author': author,
'content': content,
'date': date})
print posts
对于您发布的html,它会打印:
[{'content': u'edit this is post text - what i want',
'date': u'27 February, 2007',
'author': u'this is the author name, also want, have way to get',
'title': u'edit'},
{'content': u'edit post text, what I want.',
'date': u'26 February, 2007',
'author': u'edit author name',
'title': u'edit'}]