Question

我想将存在的所有图像提取到PDF文件中。我尝试了多个库，例如fixz，pdfminer，minecart等，但都没有令人满意的结果

使用根据https://denis.papathanasiou.org/archive/2010.08.04.post.pdf的pdfminer代码，当我尝试打印bytes_as_hex时，它返回“ 7a7e656a”，它不在所有文件签名（https://en.wikipedia.org/wiki/List_of_file_signatures）的全局列表中，因此如何进行进一步获得此图像的正确扩展名并保存？

def determine_image_type (stream_first_4_bytes):
    """Find out the image file type based on the magic number comparison of the first 4 (or 2) bytes"""
    file_type = None
    bytes_as_hex = b2a_hex(stream_first_4_bytes)
    bytes_as_hex = (bytes_as_hex).decode('utf-8')
    print(bytes_as_hex)   #output: 7a7e656a
    if bytes_as_hex.startswith('ffd8'):
        file_type = '.jpeg'
    elif bytes_as_hex == '89504e47':
        file_type = '.png'
    elif bytes_as_hex == '47494638':
        file_type = '.gif'
    elif bytes_as_hex.startswith('424d'):
        file_type = '.bmp'

    return file_type

在线工具PDF Candy（https://pdfcandy.com/pdf-ocr.html）能够提取输入文件的所有图像。我想获得类似的输出。包含4张图片（https://drive.google.com/file/d/1A6v-FJXW_ujEBCvY1HTa1TodGZKy5QAo/view?usp=sharing）的参考文件

Answer 1

以下命令给出了预期的输出，其中包含保存在xml文件中的图像和要保存在本地系统中的图像的坐标。（默认缩放系数为1.5，所以我明确地写了1来考虑任何因素）

pdftohtml -xml -zoom 1 file.pdf

Answer 2

“ 7a7e656a ...”是其中一张图像的实际像素值。

要查看此内容

使用pdftohtml提取图像：

pdftohtml -zoom 1 -xml 1.pdf
这将产生四个文件：

1-1_1.png 1-1_2.jpg 1-2_1.png 1-2_2.jpg
将png转换为pbm：

pngtopnm 1-1_1.png> 111.pbm
检查pbm：

od -h 111.pbm | head

0000000 3650 350a 2035 3437 320a 3535 7a 0a 657e

0000020 6e 6a 6855 556c 6663 6c53 5c6f 6967 655b

如果您查看原始pdf中解压缩的flatedecode流，就可以在其中看到它，寻找

/W 55
/H 74
/BPC 8
/CS /RGB
ID
z~ejnUhlUc......

我不能说为什么pdftohtml可以识别它并将其转换为png。

哪个文件类型是“ 7a7e656a”幻数？

2 个答案: