对从包含Unicode的HTML文件读取的字符串使用.replace()方法

时间:2016-09-29 17:07:52

标签: python html string unicode io

我想将.html文件作为原始文本读取,并将包含unicode字符的子字符串的实例替换为另一个子字符串。假设文件mm03.html只包含一行文本:

<span style='font-size:14.0pt'>«test»</span>

我想阅读mm03.html,将其原始文本解析为字符串,并调用replace以使输出看起来像这样:

<span style='font-size:14.0pt'>TEST</span>

我第一次尝试这样做时,我写了以下代码......

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
print htmlFill
htmlFill = htmlFill.replace("«test»","TEST")
print htmlFill
htmlBase.close()

...期望首先打印上面列出的原始行,然后是第二行。相反,它列出了第一行两次。

好。所以它可能是一个Unicode解码问题,对吧?也许,但是当我根据遍布本网站的Unicode相关建议修改代码时,不同阴影的问题仍然存在。此外,通过将htmlBase明确定义为...

,可以实现所需的功能
htmlBase = """<span style='font-size:14.0pt'>«test»</span>"""

...这让我相信我在Python中读取html文件时所知道的一些东西。我试过在&#39; w&#39;中打开mmo3.html。模式,但似乎不起作用,往往会破坏原始文件。从只读文件读取的字符串本身应该是只读的,但我可能错了,这没有多大意义。

以下是我已经咀嚼过的几个脚本/输出对。

  1. 添加不带引号的字符&#39; u&#39;在我想要替换的字符串之前

    # -*- coding: utf-8 -*-
    import codecs
    htmlBase = codecs.open("mm03.html",'r')
    htmlFill = htmlBase.read()
    print htmlFill
    htmlFill = htmlFill.replace(u"«test»","TEST")
    print htmlFill
    htmlBase.close()
    

    输出:

    <span style='font-size:14.0pt'>½test╗</span>
    Traceback (most recent call last):
      File "test2.py", line 6, in <module>
        htmlFill = htmlFill.replace(u"«test»","TEST")
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
    
  2. 将.decode(&#39; utf-8&#39;)应用于从.read()传递的字符串

    # -*- coding: utf-8 -*-
    import codecs
    htmlBase = codecs.open("mm03.html",'r')
    htmlFill = htmlBase.read().decode('utf-8')
    print htmlFill
    htmlFill = htmlFill.replace(u"«test»","TEST")
    print htmlFill
    htmlBase.close()
    

    输出:

    Traceback (most recent call last):
      File "test2.py", line 4, in <module>
        htmlFill = htmlBase.read().decode('utf-8')
      File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte
    
  3. 将.encode(&#39; utf-8&#39;)应用于从.read()传递的字符串

    # -*- coding: utf-8 -*-
    import codecs
    htmlBase = codecs.open("mm03.html",'r')
    htmlFill = htmlBase.read().encode('utf-8')
    print htmlFill
    htmlFill = htmlFill.replace(u"«test»","TEST")
    print htmlFill
    htmlBase.close()
    

    输出:

    Traceback (most recent call last):
      File "test2.py", line 4, in <module>
        htmlFill = htmlBase.read().encode('utf-8')
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
    
  4. 将.decode(&#39; utf-8&#39;)应用于从.read()传递的字符串,而不使用&#39; u&#39;目标子字符串上的后缀

    # -*- coding: utf-8 -*-
    import codecs
    htmlBase = codecs.open("mm03.html",'r')
    htmlFill = htmlBase.read().decode('utf-8')
    print htmlFill
    htmlFill = htmlFill.replace("«test»","TEST")
    print htmlFill
    htmlBase.close()
    

    输出:

    Traceback (most recent call last):
      File "test2.py", line 4, in <module>
        htmlFill = htmlBase.read().decode('utf-8')
      File "C:\Python27\lib\encodings\utf_8.py", line 16, in decode
        return codecs.utf_8_decode(input, errors, True)
    UnicodeDecodeError: 'utf8' codec can't decode byte 0xab in position 31: invalid start byte
    
  5. 将.encode(&#39; utf-8&#39;)应用于从.read()传递的字符串,而不使用&#39; u&#39;目标子字符串上的后缀

    # -*- coding: utf-8 -*-
    import codecs
    htmlBase = codecs.open("mm03.html",'r')
    htmlFill = htmlBase.read().encode('utf-8')
    print htmlFill
    htmlFill = htmlFill.replace("«test»","TEST")
    print htmlFill
    htmlBase.close()
    

    输出:

    Traceback (most recent call last):
      File "test2.py", line 4, in <module>
        htmlFill = htmlBase.read().encode('utf-8')
    UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 31: ordinal not in range(128)
    

1 个答案:

答案 0 :(得分:0)

在将字符串传递给str.replace()之前,您需要解码要替换的字符串。这对我有用:

# -*- coding: utf-8 -*-
import codecs
htmlBase = codecs.open("mm03.html",'r')
htmlFill = htmlBase.read()
htmlFill = codecs.decode(htmlFill,'utf-8')
substr = codecs.decode("«test»",'utf-8')
htmlFill = htmlFill.replace(substr,"TEST")
print htmlFill
htmlBase.close()