`re`模块:如何删除所有HTML标签?

时间:2015-09-20 09:57:55

标签: python

我正在编写一个可以使用Python 3从Stack Overflow下载问题的程序。现在我完成了,这是代码:

import os
import re
import urllib.request

req = urllib.request.Request('https://stackoverflow.com/questions/32535816/use-for-loop-inside-another-for-in-python-3')

req.add_header("user-agent", "Mozilla/5.0 (X11; Linux x86_64)\
               AppleWebKit/537.36 (KHTML, like Gecko)\
               Chrome/45.0.2454.93 Safari/537.36")

html = urllib.request.urlopen(req)
webpage = html.read().decode('utf-8')

text = re.search(r'<div class="post-text" itemprop="text">.+?</div>',
                  webpage, re.S)

with open('text', 'w') as f:
    for i in text.group():
        f.write(i)

输出结果为:

<div class="post-text" itemprop="text">

<p>I'm trying to print a file in rainbow colors. But however I have a problem, here is my code:</p>

<pre><code>color = [91, 93, 92, 96, 94, 95]

with open(sys.argv[1]) as f:
for i in f.read():
    for c in color:
        print('\033[{0}m{1}\033[{0};m'
              .format(c, i), end='', flush=True)
</code></pre>

<p>the question is, I want the output like this: <code>Hello</code>(<code>H</code> in red, <code>e</code> in yellow, etc. )</p>

<p>but I got the output like this:<code>HHHHHeeeeellll...</code>(first <code>H</code> in red, second <code>H</code> in yello, etc.)</p>

<p>I know that because the first <code>for</code> will loop the second <code>for</code>. So how can I solve this?</p>
    </div>

我认为它工作得很好,但我想删除所有HTML标记。我尝试过这样使用re.sub

text = re.sub('<.+?>', '', text)

但我收到了这个错误:

Traceback (most recent call last):
  File "1.py", line 18, in <module>
    text = re.sub('<.+?>', '', text)
  File "/usr/lib/python3.4/re.py", line 179, in sub
    return _compile(pattern, flags).sub(repl, string, count)
TypeError: expected string or buffer

这是什么意思,我该如何解决?

0 个答案:

没有答案