如何从Python中读取ANSI和Unicode编码文件

时间:2018-05-31 23:36:58

标签: python python-3.x unicode encoding ansi

我正在尝试使用Python编写一个函数,用户输入根目录和要搜索的关键短语。然后我的函数搜索整个目录,从包含输入的关键短语的文件中查找和输出行。目前,我的脚本能够从使用ANSI编码但不是Unicode编码的文件中读取和输出行。请让我知道如何更改我的代码,以便我的脚本可以搜索两种类型的编码文件。我对Python比较新,谢谢!

我的Python脚本:

import os

def myFunction(rootdir, keyPhrases):

    path = rootdir # Enter the root directory you want to search from

    key_phrases = [keyPhrases] # Enter here the key phrases in the lines you hope to find 
    key_phrases = [i.replace('\n','') for i in key_phrases] #In case an \n is added to the end of the string when the parameter is passed to the function

    # This for loop allows all sub directories and files to be searched
    for (path, subdirs, files) in os.walk(path): 
        files = [f for f in os.listdir(path) if f.endswith('.txt') or f.endswith('.log')] # Specify here the format of files you hope to search from (ex: ".txt" or ".log")
        files.sort() # file is sorted list

        files = [os.path.join(path, name) for name in files] # Joins the path and the name, so the files can be opened and scanned by the open() function

        # The following for loop searches all files with the selected format
        for filename in files:

                # Opens the individual files and to read their lines
                with open(filename) as f:
                    f = f.readlines()

                # The following loop scans for the key phrases entered by the user in every line of the files searched, and stores the lines that match into the "important" array
                for line in f:
                    for phrase in key_phrases: 
                        if phrase in line:
                            print(line)
                            break 

    print("The end of the directory has been reached, if no lines are printed then that means the key phrase does not exist in the root directory you entered.")

1 个答案:

答案 0 :(得分:1)

在Windows“Unicode”(UTF16)编码文件中,前2个字节通常是字节顺序标记(BOM),值为0xFF 0xFE。这表示UTF16小端编码。 “ANSI”(通常是Windows-1252)文件没有标记。

当您尝试读取UTF16文件时,就像使用不同的8位编码进行编码一样,例如UTF8,Windows-1252或ASCII您将看到UnicodeDecodeError异常,因为0xFF不是这些编码中的有效字节(或UTF-8的有效起始字节)。

因此,如果您确定文件将是UTF-16-LE或Windows-1252编码,那么您可以在文件开头测试UTF16 BOM并使用该编码打开文件,如果检测:

import sys
from codecs import BOM_UTF16_LE

def get_file_encoding(filename, default=None):
    with open(filename, 'rb') as f:
        if f.read(2) == BOM_UTF16_LE:
            return 'utf_16'
        return default if default else sys.getdefaultencoding()

with open(filename, encoding=get_file_encoding(filename, 'windows_1252')) as f:
    for line in f:
        for phrase in key_phrases: 
            if phrase in line:
                print(line)
                break 

此外,您可以考虑使用正则表达式进行短语匹配,而不是循环使用可能的短语。