如果它们属于字符串列表,则删除字符串段落中的项目吗?

时间:2015-09-15 19:38:58

标签: python string list replace beautifulsoup

 import urllib2,sys
 from bs4 import BeautifulSoup,NavigableString

 obama_4427_url = 'http://www.millercenter.org/president/obama/speeches/speech-4427'
 obama_4427_html = urllib2.urlopen(obama_4427_url).read()

 obama_4427_soup = BeautifulSoup(obama_4427_html)

 # find the speech itself within the HTML

 obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})

 # convert soup to string for easier processing

 obama_4427_str = str(obama_4427_div)

 # list of characters to be removed from obama_4427_str

 remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
 remove_char


 for char in obama_4427_str:
 if char in obama_4427_str:
     obama_4427_replace = obama_4427_str.replace(remove_char,'')


 obama_4427_replace = obama_4427_str.replace(remove_char,'')

 print(obama_4427_replace)

使用BeautifulSoup,我从上述网站上删除了奥巴马的一篇演讲稿。现在,我需要以有效的方式替换一些残留的HTML。我已经存储了我想在remove_char中删除的元素列表。我试图写一个简单的for语句,但收到错误:TypeError: expected a character object buffer。我知道这是一个初学者的问题,但我怎么能解决这个问题?

1 个答案:

答案 0 :(得分:1)

由于您已经使用BeautifulSoup,因此您可以直接使用obama_4427_div.text代替str(obama_4427_div)来获取格式正确的文本。然后,您获得的文本将不包含任何残留的html元素等。

示例 -

>>> obama_4427_div = obama_4427_soup.find('div',{'id': 'transcript'},{'class': 'displaytext'})
>>> obama_4427_str = obama_4427_div.text
>>> print(obama_4427_str)

Transcript
To Chairman Dean and my great friend Dick Durbin; and to all my fellow citizens of this great nation;

With profound gratitude and great humility, I accept your nomination for the presidency of the United States.

Let me express my thanks to the historic slate of candidates who accompanied me on this ...
...
...
...
Thank you, God Bless you, and God Bless the United States of America.

为了完整性,要从字符串中删除元素,我会创建一个要删除的元素列表(如您创建的remove_char列表),然后我们可以对字符串执行str.replace()列表中的元素。示例 -

obama_4427_str = str(obama_4427_div)
remove_char = ['<br/>','</p>','</div>','<div class="indent" id="transcript">','<h2>','</h2>','<p>']
for char in remove_char:
    obama_4427_str = obama_4427_str.replace(char,'')