问题:我从WARC文件中提取了内容块。我正在编写一个过滤器来检查此内容块的mimetype ,然后 将内容保存到文件中。特别是,我只对application / pdf类型感兴趣。内容的前几行看起来像
HTTP/1.1 200 OK^ML, 388610C
Date: Wed, 26 Jun 2013 02:18:37 GMT^M
Server: Apache^M
Last-Modified: Thu, 02 Dec 2010 22:54:07 GMT^M
ETag: "9002f-41fc8-4c94c1c0"^M
Accept-Ranges: bytes^M
Content-Length: 270280^M
Connection: close^M
Content-Type: application/pdf^M
^M
%PDF-1.4
%ÐÔÅØ
1 0 obj
<< /S /GoTo /D [2 0 R /Fit ] >>
endobj
7 0 obj <<
/Length 297
/Filter /FlateDecode
>>
stream
尝试的方法:
我尝试了以下方法,但都失败了(假设content
变量包含上面的块):
(1)StringIO + mimetypes
from StringIO import StringIO
import mimetypes
iocontent = StringIO(content)
print mimetypes.guess_type(iocontent)
it just prints (None,None).
(2)魔术包
import magic
print magic.from_buffer(content)
it prints `ASCII text, with CRLF, LF line terminators`.
(3)subprocess.Popen()
from subprocess import Popen, PIPE,STDOUT
p = Popen('file --mime-type', stdout=PIPE, stdin=PIPE, stderr=STDOUT)
cmd_out = p.communicate(input=content)[0]
输出是错误消息:
Traceback (most recent call last):
File "warc_extract_pdf.py", line 123, in <module>
run()
File "warc_extract_pdf.py", line 102, in run
sys.exit(main(argvs))
File "warc_extract_pdf.py", line 35, in main
if extract_pdf(offset,record,outdir,outlog):
File "warc_extract_pdf.py", line 61, in extract_pdf
if not mimetype(record,'application/pdf'): return False
File "warc_extract_pdf.py", line 75, in mimetype
p = Popen('file --mime-type', stdin=PIPE, stdout=PIPE, stderr=STDOUT)
File "/usr/lib64/python2.6/subprocess.py", line 642, in __init__
errread, errwrite)
File "/usr/lib64/python2.6/subprocess.py", line 1234, in _execute_child
raise child_exception
OSError: [Errno 2] No such file or directory
请求帮助!
答案 0 :(得分:0)
warc是一个python库,用于解析WARC文件并从中获取信息。此文件只是文本,直到您将其解析为http请求。从他们的例子中,你的用例看起来像这样:
import warc
f = warc.open("test.warc")
for record in f:
print record.get("Content-Type","text/html")
答案 1 :(得分:0)
这是一个老问题,但我想我还是可以回答。
Python-Magic将在这里工作。只需使用.from_buffer(缓冲区, mime = True )
import magic
import StringIO
msg_part_io_str = StringIO.StringIO()
with open('./Downloads/test123123.pdf', 'r') as f:
msg_part_io_str.write(f.read())
d = magic.from_buffer(msg_part_io_str.getvalue(), mime=True)
print d
application/pdf