从thjis推文中删除html代码

时间:2017-12-19 07:36:54

标签: python html twitter

original_tweet= 'I luv my <3 iphone & you’re awsm apple. DisplayIsAwesome, sooo happppppy  http://www.apple.com”

import HTMLParser

html_parser = HTMLParser.HTMLParser()

tweet = html_parser.unescape(original_tweet)

我在此代码中遇到此错误。帮我摆脱它。

UnicodeDecodeError                        Traceback (most recent call last)
<ipython-input-12-58919c61b71f> in <module>()
----> 1 tweet = html_parser.unescape(original_tweet)
      2 tweet

C:\Users\vntja\Anaconda2\ds\lib\HTMLParser.pyc in unescape(self, s)
    474                     return '&'+s+';'
    475 
--> 476         return re.sub(r"&(#?[xX]?(?:[0-9a-fA-F]+|\w{1,8}));", replaceEntities, s)

C:\Users\vntja\Anaconda2\ds\lib\re.pyc in sub(pattern, repl, string, count, flags)
    153     a callable, it's passed the match object and must return
    154     a replacement string to be used."""
--> 155     return _compile(pattern, flags).sub(repl, string, count)
    156 
    157 def subn(pattern, repl, string, count=0, flags=0):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 4: ordinal not in range(128)

1 个答案:

答案 0 :(得分:0)

在python脚本的顶部添加此行

# -*- coding: utf-8 -*-

您正在尝试将某些内容解码为未在ASCII中定义的ASCII。