import urllib2,sys
from bs4 import BeautifulSoup,NavigableString
obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
obama_4427_html = urllib2.urlopen(obama_4427_url).read()
obama_4427_soup = BeautifulSoup(obama_4427_html)
# find the speech itself within the HTML
obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
# convert soup to string for easier processing
obama_4427_str = str(obama_4427_div)
# list of characters to be removed from obama_4427_str
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
remove_char
for char in obama_4427_str:
if char in obama_4427_str:
obama_4427_replace = obama_4427_str.replace(remove_char,'')
obama_4427_replace = obama_4427_str.replace(remove_char,'')
print(obama_4427_replace)
使用BeautifulSoup
,我从上述网站上删除了奥巴马的一篇演讲稿。现在,我需要以有效的方式替换一些残留的HTML。我已经存储了我想在remove_char
中删除的元素列表。我试图写一个简单的for
语句,但收到错误:TypeError: expected a character object buffer
。我知道这是一个初学者的问题,但我怎么能解决这个问题?
答案 0 :(得分:1)
由于您已经使用BeautifulSoup
,因此您可以直接使用obama_4427_div.text
代替str(obama_4427_div)
来获取格式正确的文本。然后,您获得的文本将不包含任何残留的html
元素等。
示例 -
>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)
Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;
With profound gratitude and great humility, I accept your nomination for the presidency of the United States.
Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.
为了完整性,要从字符串中删除元素,我会创建一个要删除的元素列表(如您创建的remove_char
列表),然后我们可以对字符串执行str.replace()
列表中的元素。示例 -
obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
obama_4427_str = obama_4427_str.replace(char,'')