Question

我想测试一下我从文件中提取的字节串是否会产生有效的ISO-8859-15编码文本。我遇到的第一件事是关于UTF-8验证的类似案例：

https://stackoverflow.com/a/5259160/1209004

基于此，我认为通过为ISO-8859-15做类似的事情我很聪明。请参阅以下演示代码：

#! /usr/bin/env python
#

def isValidISO885915(bytes):
    # Test if bytes result in valid ISO-8859-15
    try:
        bytes.decode('iso-8859-15', 'strict')
        return(True)
    except UnicodeDecodeError:
        return(False)

def main():
    # Test bytes (byte x95 is not defined in ISO-8859-15!)
    bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'

    isValidLatin = isValidISO885915(bytes)
    print(isValidLatin)

main()

但是，运行此命令会返回 True ，即使 x95 不是ISO-8859-15中的有效代码点！我忽略了一些非常明显的东西吗？（顺便说一下，我用Python 2.7.4和3.3尝试过这种方法，结果在两种情况下都相同）。

Answer 1

我想我自己找到了一个可行的解决方案，所以我不妨分享一下。

查看ISO 8859-15（see here）的代码页布局，我真的只需要检查是否存在代码点 00 - 1f 和 7f - 9f 。这些与C0 and C1 control codes相对应。

在我的项目中，我已经使用基于the code here的内容来从字符串中删除控制字符（ C0 + C1 ）。所以，以此为基础，我提出了这个：

#! /usr/bin/env python
#
import unicodedata

def removeControlCharacters(string):
    # Remove control characters from string
    # Based on: https://stackoverflow.com/a/19016117/1209004

    # Tab, newline and return are part of C0, but are allowed in XML
    allowedChars = [u'\t', u'\n',u'\r']
    return "".join(ch for ch in string if 
        unicodedata.category(ch)[0] != "C" or ch in allowedChars)

def isValidISO885915(bytes):
    # Test if bytes result in valid ISO-8859-15

    # Decode bytes to string
    try:
        string = bytes.decode("iso-8859-15", "strict")
    except:
        # Empty string in case of decode error
        string = ""

    # Remove control characters, and compare result against
    # input string
    if removeControlCharacters(string) == string:
        isValidLatin = True
    else:
        isValidLatin = False

    return(isValidLatin)

def main():
    # Test bytes (byte x95 is not defined in ISO-8859-15!)

    bytes = b'\x4A\x70\x79\x6C\x79\x7A\x65\x72\x20\x64\x95\x6D\x6F\xFF'

    print(isValidISO885915(bytes)) 


main()

可能有更优雅/ Pythonic的方法可以做到这一点，但它似乎可以解决问题，并且适用于Python 2.7和3.3。

检查字节是否在Python中生成有效的ISO 8859-15（Latin）

1 个答案: