Question

从一个巨大的文本文件中，需要能够识别包含非拉丁字符（\ w加上特殊字符）的行，从技术上讲，我应该排除拉丁字母以外的其他字母。输出存储在日志文件中以进行进一步处理。我对re的尝试未成功，您是否发现一种识别和散发包含非拉丁字符的行的聪明方法。

import pandas as pd
import re
pattern = '^\w+$'
regex = re.compile(pattern)
filename = "C:\\ImportTool\\import\\file.csv"
with open(filename, 'r', encoding='utf8', errors='ignore') as inputfile, \
     open(filename + '.clean', 'w', encoding="utf8") as outputfile, \
     open(filename + '.special', 'w', encoding="utf8") as outputfile_log:
        for index, line in enumerate(inputfile):
            #print(index, (line_aux[:]))
            if  (regex.search(line) == None):
                outputfile.writelines(line)
            else:
                outputfile_log.writelines(line)

I.e由于内容希伯来语，因此以下行应排除

"100";"xxxxxxxxx";"00002";"ZM";"B";"";"";"B";"R";"R";"X";"RR";"I02";"OxxH";"20161107";"ybatuca";"זמניים מחלקת תיפעול חיפה";"";"";"IL01";"";"";"";"";"1000.000 "

识别非拉丁字符集文字Ph

0 个答案: