我有一个从扫描软件生成的pdf。 pdf每页有1个TIFF图像。我想从每个页面中提取TIFF图像。
我正在使用iTextSharp并且我已成功找到图像并可以从PdfReader.GetStreamBytesRaw
方法获取原始字节。问题是,正如我之前发现的那样,iTextSharp不包含PdfReader.CCITTFaxDecode
方法。
我还知道什么?即使没有iTextSharp,我也可以在记事本中打开pdf并找到/Filter /CCITTFaxDecode
的流,我从/DecodeParams
知道它正在使用CCITTFaxDecode组4。
有没有人知道如何从我的pdf中获取CCITTFaxDecode过滤图像?
干杯, 卡胡
答案 0 :(得分:15)
实际上,vbcrlfuser的回答对我有帮助,但是对于当前版本的BitMiracle.LibTiff.NET,代码并不完全正确,因为我可以下载它。在当前版本中,等效代码如下所示:
using iTextSharp.text.pdf;
using BitMiracle.LibTiff.Classic;
...
Tiff tiff = Tiff.Open("C:\\test.tif", "w");
tiff.SetField(TiffTag.IMAGEWIDTH, UInt32.Parse(pd.Get(PdfName.WIDTH).ToString()));
tiff.SetField(TiffTag.IMAGELENGTH, UInt32.Parse(pd.Get(PdfName.HEIGHT).ToString()));
tiff.SetField(TiffTag.COMPRESSION, Compression.CCITTFAX4);
tiff.SetField(TiffTag.BITSPERSAMPLE, UInt32.Parse(pd.Get(PdfName.BITSPERCOMPONENT).ToString()));
tiff.SetField(TiffTag.SAMPLESPERPIXEL, 1);
tiff.WriteRawStrip(0, raw, raw.Length);
tiff.Close();
使用上面的代码,我终于在C:\ test.tif中获得了一个有效的Tiff文件。谢谢,vbcrlfuser!
答案 1 :(得分:7)
这个库... http://www.bitmiracle.com/libtiff/以及下面这个例子可以让你获得99%的
string filter = pd.Get(PdfName.FILTER).ToString();
string width = pd.Get(PdfName.WIDTH).ToString();
string height = pd.Get(PdfName.HEIGHT).ToString();
string bpp = pd.Get(PdfName.BITSPERCOMPONENT).ToString();
switch (filter)
{
case "/CCITTFaxDecode":
byte[] data = PdfReader.GetStreamBytesRaw((PRStream)pdfStream);
int tiff = TIFFOpen("example.tif", "w");
TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEWIDTH,(uint)Int32.Parse(width));
TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.IMAGEHEIGHT, (uint)Int32.Parse(height));
TIFFSetField(tiff, (uint)BitMiarcle.LibTiff.Classic.TiffTag.COMPRESSION, (uint)BitMiracle.Libtiff.Classic.Compression.CCITTFAX4);
TIFFSetField(tiff, (uint)BitMiracle.LibTiff.Classic.TiffTag.BITSPERSAMPLE, (uint)Int32.Parse(bpp));
TIFFSetField(tiff, (uint)BitMiarcle.Libtiff.Classic.TiffTag.SAMPLESPERPIXEL,1 );
IntPtr pointer = Marshal.AllocHGlobal(data.length);
Marshal.copy(data, 0, pointer, data.length);
TIFFWriteRawStrip(tiff, 0, pointer, data.length);
TIFFClose(tiff);
break;
break;
}
答案 2 :(得分:6)
这是python实现:
import PyPDF2
import struct
"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""
def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
return struct.pack(tiff_header_struct,
b'II', # Byte order indication: Little indian
42, # Version number (always 42)
8, # Offset to first IFD
8, # Number of tags in IFD
256, 4, 1, width, # ImageWidth, LONG, 1, width
257, 4, 1, height, # ImageLength, LONG, 1, lenght
258, 3, 1, 1, # BitsPerSample, SHORT, 1, 1
259, 3, 1, CCITT_group, # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
262, 3, 1, 0, # Threshholding, SHORT, 1, 0 = WhiteIsZero
273, 4, 1, struct.calcsize(tiff_header_struct), # StripOffsets, LONG, 1, len of header
278, 4, 1, height, # RowsPerStrip, LONG, 1, lenght
279, 4, 1, img_size, # StripByteCounts, LONG, 1, size of image
0 # last IFD
)
pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
page = cond_scan_reader.getPage(i)
xObject = page['/Resources']['/XObject'].getObject()
for obj in xObject:
if xObject[obj]['/Subtype'] == '/Image':
"""
The CCITTFaxDecode filter decodes image data that has been encoded using
either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
designed to achieve efficient compression of monochrome (1 bit per pixel) image
data at relatively low resolutions, and so is useful only for bitmap image data, not
for color images, grayscale images, or general data.
K < 0 --- Pure two-dimensional encoding (Group 4)
K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
"""
if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
if xObject[obj]['/DecodeParms']['/K'] == -1:
CCITT_group = 4
else:
CCITT_group = 3
width = xObject[obj]['/Width']
height = xObject[obj]['/Height']
data = xObject[obj]._data # sorry, getData() does not work for CCITTFaxDecode
img_size = len(data)
tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
img_name = obj[1:] + '.tiff'
with open(img_name, 'wb') as img_file:
img_file.write(tiff_header + data)
#
# import io
# from PIL import Image
# im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()
答案 3 :(得分:-1)
为此我写了extension。这很简单
PdfDictionary item;
if (item.IsImage()) {
Image image = item.ToImage();
}