Question

我正在尝试处理大量的txt文件集合，这些文件本身就是我想要处理的实际文件的容器。 txt文件具有sgml标记，用于为我正在处理的各个文件设置边界。有时，包含的文件是二进制文件，已经过uuencoded。我已经解决了解码uuencoded文件的问题，但是当我正在考虑我的解决方案时，我已经确定它不够通用。也就是说，我一直在使用

if '\nbegin 644 ' in document['document']

测试文件是否为uuencoded。我做了一些搜索，对644的含义（文件权限）有了模糊的理解，然后发现了其他可能有uuencoded文件的例子

if '\nbegin 642 ' in document['document']

甚至其他一些替代品。因此，我的问题是如何确保捕获/识别所有具有uuencoded文件的子容器。

一种解决方案是测试每个子容器：

uudecode=codecs.getdecoder("uu")

for document in documents:
    try:
        decoded_document,m=uudecode(document)
    except ValueError:
         decoded_document=''
    if len(decoded_document)==0
        more stuff

这并不可怕，cpu-cycle很便宜，但我将处理大约800万份文件。

因此，有没有更健壮的方法来识别特定字符串是否是uuencoding的结果？

Answer 1

Wikipedia says每个uuencoded文件都以此行开头

begin <perm> <name>

因此，匹配正则表达式^begin [0-7]{3} (.*)$的行可能足够可靠地表示开头。

Answer 2

两种方式：

（1）在基于Unix的系统上，您可以稳健地使用file命令。

http://unixhelp.ed.ac.uk/CGI/man-cgi?file

$ file foo
foo: uuencoded or xxencoded text

（2）我还发现了以下（未经测试的）Python代码，它看起来会像你想要的那样（http://ubuntuforums.org/archive/index.php/t-1304548.html）。

#!/usr/bin/env python
import magic
import sys
filename=sys.argv[1]
ms = magic.open(magic.MAGIC_NONE)
ms.load()
ftype = ms.file(filename)
print ftype
ms.close()

试图确定文件是否已经过编码

2 个答案: