Question

我需要确定哪个文件是二进制，哪个是目录中的文本。

我尝试使用 mimetypes ，但在我的情况下这不是一个好主意，因为它无法识别所有文件的哑剧，我在这里有陌生人......我只需要知道，二进制或文本。简单吗？但我找不到解决方案......

由于

Answer 1

谢谢大家，我找到了一个适合我的问题的解决方案。我在http://code.activestate.com/recipes/173220/找到了这段代码，我改变了一小块以适合我。

工作正常。

from __future__ import division
import string 

def istext(filename):
    s=open(filename).read(512)
    text_characters = "".join(map(chr, range(32, 127)) + list("\n\r\t\b"))
    _null_trans = string.maketrans("", "")
    if not s:
        # Empty files are considered text
        return True
    if "\0" in s:
        # Files with null bytes are likely binary
        return False
    # Get the non-text characters (maps a character to itself then
    # use the 'remove' option to get rid of the text characters.)
    t = s.translate(_null_trans, text_characters)
    # If more than 30% non-text characters, then
    # this is considered a binary file
    if float(len(t))/float(len(s)) > 0.30:
        return False
    return True

Answer 2

本质上不简单。虽然在大多数情况下你可以采取一个相当不错的猜测，但是没有办法确定。

您可能想做的事情：

在二进制签名中查找已知的幻数
在文件开头查找Unicode字节顺序标记
如果文件定期为00 xx 00 xx 00 xx（对于任意xx），反之亦然，那很可能是UTF-16
否则，在文件中查找0;一个0 in的文件不太可能是一个单字节编码的文本文件。

但它完全是启发式的 - 例如，很可能有一个文件是有效的文本文件和一个有效的图像文件。作为一个文本文件可能是无稽之谈，但在某些编码或其他编码中是合法的......

Answer 3

如果您的脚本在* nix上运行，您可以使用以下内容：

import subprocess
import re

def is_text(fn):
    msg = subprocess.Popen(["file", fn], stdout=subprocess.PIPE).communicate()[0]
    return re.search('text', msg) != None

Answer 4

可以使用libmagic使用python-magic来猜测文件的MIME类型。如果您在"text/*"命名空间中找回某些内容，则可能是文本文件，而其他任何内容都可能是binary file。

如何使用Python识别二进制文件和文本文件？

4 个答案: