Question

问题：我从WARC文件中提取了内容块。我正在编写一个过滤器来检查此内容块的mimetype ，然后 将内容保存到文件中。特别是，我只对application / pdf类型感兴趣。内容的前几行看起来像

HTTP/1.1 200 OK^ML, 388610C
Date: Wed, 26 Jun 2013 02:18:37 GMT^M
Server: Apache^M
Last-Modified: Thu, 02 Dec 2010 22:54:07 GMT^M
ETag: "9002f-41fc8-4c94c1c0"^M
Accept-Ranges: bytes^M
Content-Length: 270280^M
Connection: close^M
Content-Type: application/pdf^M
^M
%PDF-1.4
%ÐÔÅØ
1 0 obj
<< /S /GoTo /D [2 0 R  /Fit ] >>
endobj
7 0 obj <<
/Length 297
/Filter /FlateDecode
>>
stream

尝试的方法：我尝试了以下方法，但都失败了（假设content变量包含上面的块）：（1）StringIO + mimetypes

from StringIO import StringIO
import mimetypes
iocontent = StringIO(content)
print mimetypes.guess_type(iocontent)

it just prints (None,None).

（2）魔术包

import magic
print magic.from_buffer(content)

it prints `ASCII text, with CRLF, LF line terminators`.

（3）subprocess.Popen（）

from subprocess import Popen, PIPE,STDOUT

p = Popen('file --mime-type', stdout=PIPE, stdin=PIPE, stderr=STDOUT)
cmd_out = p.communicate(input=content)[0]

输出是错误消息：

Traceback (most recent call last):
  File "warc_extract_pdf.py", line 123, in <module>
    run()
  File "warc_extract_pdf.py", line 102, in run
    sys.exit(main(argvs))
  File "warc_extract_pdf.py", line 35, in main
    if extract_pdf(offset,record,outdir,outlog): 
  File "warc_extract_pdf.py", line 61, in extract_pdf
    if not mimetype(record,'application/pdf'): return False
  File "warc_extract_pdf.py", line 75, in mimetype
    p = Popen('file --mime-type', stdin=PIPE, stdout=PIPE, stderr=STDOUT)
  File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
    errread, errwrite)
  File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
    raise child_exception
OSError: [Errno 2] No such file or directory

请求帮助！

Answer 1

warc是一个python库，用于解析WARC文件并从中获取信息。此文件只是文本，直到您将其解析为http请求。从他们的例子中，你的用例看起来像这样：

import warc
f = warc.open("test.warc")
for record in f:
    print record.get("Content-Type","text/html")

Answer 2

这是一个老问题，但我想我还是可以回答。

Python-Magic将在这里工作。只需使用.from_buffer（缓冲区， mime = True ）

import magic
import StringIO
msg_part_io_str = StringIO.StringIO()
with open('./Downloads/test123123.pdf', 'r') as f:
    msg_part_io_str.write(f.read())

d = magic.from_buffer(msg_part_io_str.getvalue(), mime=True)

print d
application/pdf

https://github.com/ahupp/python-magic#usage

使用Python检查存储数据的mimetype

2 个答案: