HTML Stripper导致错误

时间:2015-01-12 14:17:16

标签: python html stripping

我目前正在从文本中删除一些HTML,如下所示:

<p><b>Masala</b> films of <a href="/wiki/Cinema_of_India" title="Cinema of India">Indian cinema</a> are those that mix genres in one work. Typically these films freely mix <a href="/wiki/Action_film" title="Action film">action</a>, <a href="/wiki/Comedy_film" title="Comedy film">comedy</a>, <a href="/wiki/Romance_film" title="Romance film">romance</a>, and <a href="/wiki/Drama_film" title="Drama film">drama</a> or <a href="/wiki/Melodrama" title="Melodrama">melodrama</a>.<sup class="reference" id="cite_ref-Ganti2004_1-0"><a href="#cite_note-Ganti2004-1"><span>[</span>1<span>]</span></a></sup> They tend to be <a href="/wiki/Musical_film" title="Musical film">musicals</a> that include songs filmed in picturesque locations. The genre is named after the <a href="/wiki/Spice_mix" title="Spice mix">masala</a>, a mixture of <a href="/wiki/Spice" title="Spice">spices</a> in <a href="/wiki/Indian_cuisine" title="Indian cuisine">Indian cuisine</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2"><span>[</span>2<span>]</span></a></sup> According to <i><a href="/wiki/The_Hindu" title="The Hindu">The Hindu</a></i>, masala is the most popular genre of Indian cinema.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup></p>

我使用的剥离器代码如下:

from HTMLParser import HTMLParser

class MLStripper(HTMLParser):
    def __init__(self):
        self.reset()
        self.fed = []
    def handle_data(self, d):
        self.fed.append(d)
    def get_data(self):
        return ''.join(self.fed)

def strip_tags(html):
    s = MLStripper()
    print html
    s.feed(html)
    return s.get_data()

当我试图删除上面的段落时,我似乎遇到了一些问题:

para = strip_tags(paragraph)

---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-97-0f8917286c8e> in <module>()
      2 for key, val in film_links.items():
      3     paragraph = get_description_from_url( val, key)
----> 4     para = strip_tags(paragraph)
      5     film_genre_with_des.append([key, val, para])

<ipython-input-91-0c0e68f587c6> in strip_tags(html)
     13     s = MLStripper()
     14     print html
---> 15     s.feed(html)
     16     return s.get_data()

/Users/ruby/anaconda/lib/python2.7/HTMLParser.pyc in feed(self, data)
    114         as you want (may include '\n').
    115         """
--> 116         self.rawdata = self.rawdata + data
    117         self.goahead(0)
    118 

TypeError: cannot concatenate 'str' and 'Tag' objects

不太确定为什么这不起作用。这适用于Python 2.7,这是我正在使用的版本。

1 个答案:

答案 0 :(得分:1)

或者,您可以使用BeautifulSoup HTML parser,只需使用get the text

from bs4 import BeautifulSoup

data = '<p><b>Masala</b> films of <a href="/wiki/Cinema_of_India" title="Cinema of India">Indian cinema</a> are those that mix genres in one work. Typically these films freely mix <a href="/wiki/Action_film" title="Action film">action</a>, <a href="/wiki/Comedy_film" title="Comedy film">comedy</a>, <a href="/wiki/Romance_film" title="Romance film">romance</a>, and <a href="/wiki/Drama_film" title="Drama film">drama</a> or <a href="/wiki/Melodrama" title="Melodrama">melodrama</a>.<sup class="reference" id="cite_ref-Ganti2004_1-0"><a href="#cite_note-Ganti2004-1"><span>[</span>1<span>]</span></a></sup> They tend to be <a href="/wiki/Musical_film" title="Musical film">musicals</a> that include songs filmed in picturesque locations. The genre is named after the <a href="/wiki/Spice_mix" title="Spice mix">masala</a>, a mixture of <a href="/wiki/Spice" title="Spice">spices</a> in <a href="/wiki/Indian_cuisine" title="Indian cuisine">Indian cuisine</a>.<sup class="reference" id="cite_ref-2"><a href="#cite_note-2"><span>[</span>2<span>]</span></a></sup> According to <i><a href="/wiki/The_Hindu" title="The Hindu">The Hindu</a></i>, masala is the most popular genre of Indian cinema.<sup class="reference" id="cite_ref-3"><a href="#cite_note-3"><span>[</span>3<span>]</span></a></sup></p>'

soup = BeautifulSoup(data)
print soup.get_text()

打印:

Masala films of Indian cinema are those that mix genres in one work. Typically these films freely mix action, comedy, romance, and drama or melodrama.[1] They tend to be musicals that include songs filmed in picturesque locations. The genre is named after the masala, a mixture of spices in Indian cuisine.[2] According to The Hindu, masala is the most popular genre of Indian cinema.[3]