Question

我尝试使用Chardet以制表符分隔格式推断出一个非常大的文件（> 400万行）的编码。

目前，由于文件的大小，我的脚本可能会挣扎。我想将其缩小到可能会加载文件的前x行，但是当我尝试使用readline()时，我遇到了困难。

目前的脚本是：

import chardet
import os
filepath = os.path.join(r"O:\Song Pop\01 Originals\2017\FreshPlanet_SongPop_0517.txt")
rawdata = open(filepath, 'rb').readline()


print(rawdata)
result = chardet.detect(rawdata)
print(result)

它可以工作，但它只读取文件的第一行。我多次尝试使用简单的循环来调用readline()并不能很好地工作（可能是脚本以二进制格式打开文件的事实）。

一行的输出为{'encoding': 'Windows-1252', 'confidence': 0.73, 'language': ''}

我想知道增加它读取的行数是否会提高编码信心。

非常感谢任何帮助。

Answer 1

我对Chardet并不特别有经验，但是在调试我自己的问题时遇到了这个帖子，并且对它没有任何答案感到惊讶。很抱歉，如果现在为OP提供任何帮助为时已晚，但对于其他任何偶然发现此事的人来说：

我不确定读取更多文件会改进猜测的编码类型，但是你需要做的就是测试它：

import chardet
testStr = b''
count = 0
with open('Huge File!', 'rb') as x:
    line = x.readline()
    while line and count < 50:  #Set based on lines you'd want to check
        testStr = testStr + line
        count = count + 1
        line = x.readline()
print(chardet.detect(testStr))

在我的实例中，我有一个我认为有多种编码格式的文件，并编写了以下内容以“逐行”测试它。

import chardet
with open('Huge File!', 'rb') as x:
line = x.readline()
curChar = chardet.detect(line)
print(curChar)
while line:
    if curChar != chardet.detect(line):
        curChar = chardet.detect(line)
        print(curChar)
    line = x.readline()

Answer 2

UniversalDetector的另一个示例：

#!/usr/bin/env python
from chardet.universaldetector import UniversalDetector


def detect_encode(file):
    detector = UniversalDetector()
    detector.reset()
    with open(file, 'rb') as f:
        for row in f:
            detector.feed(row)
            if detector.done: break

    detector.close()
    return detector.result

if __name__ == '__main__':
    print(detect_encode('example_file.csv'))

置信度= 1.0时断裂。对于超大文件很有用。

Answer 3

另一个没有使用 python-magic 包将文件加载到内存的示例

import magic


def detect(
    file_path,
):
    return magic.Magic(
        mime_encoding=True,
    ).from_file(file_path)

Answer 4

import chardet

with open(filepath, 'rb') as rawdata:
    result = chardet.detect(rawdata.read(100000))
result

使用Chardet查找非常大的文件的编码

4 个答案: