我只需要PHP中的页面计数属性,而不仅使用内置函数(不是框架和COM)。输入是“旧”doc文件。
以下是我所知道的,我发现了这个主题,希望它能帮助您解决问题:
SummarayInformation看起来像这样,它被加密到文件代码中: :
我找到了一些可以找到提取数据的方法的C文件,但是我很难理解它。
#include <stdlib.h>
#include <stdio.h>
#include "wv_Base.h"
#include "wv_Common.h"
#include "wv.h"
#include "glib.h"
#include "ms-ole.h"
#include "ms-ole-summary.h"
/*
* This is a simple example that take an ole file and prints some
* information from the summaryinformation stream
*/
int main(int argc, char *argv[])
{
char *str = NULL;
int ret = 0;
short s = 0;
long l = 0;
MsOle *ole = NULL;
MsOleSummary *summary = NULL;
if (argc < 2)
{
fprintf(stderr, "Usage: wvSummary oledocument\n");
return(1);
}
ms_ole_open(&ole, argv[1]);
if (!ole)
{
fprintf(stderr,"sorry problem with getting ole streams from %s\n",argv[1]);
return 1;
}
summary = ms_ole_summary_open(ole);
if (!summary)
{
fprintf(stderr, "Could not open summary stream\n");
return 1;
}
ms_ole_summary_get_string(summary, , &ret);
if (ret)
printf("PageCount is %d\n", l);
else
printf("no pagecount\n");
ms_ole_summary_close(summary);
ms_ole_destroy(&ole);
return 0;
}
关于MS_OLE_SUMMARY_TITLE:
/**
* ms-ole-summary.h: MS Office OLE support
*
* Author:
* Michael Meeks (michael@imaginator.com)
* From work by:
* Caolan McNamara (Caolan.McNamara@ul.ie)
* Built on work by:
* Somar Software's CPPSUM (http://www.somar.com)
*
* Copyright 1998-2000 Helix Code, Inc., Frank Chiulli, and others.
**/
#ifndef MS_OLE_SUMMARY_H
#define MS_OLE_SUMMARY_H
#include <time.h>
#include <libole2/ms-ole.h>
/*
* MS Ole Property Set IDs
* The SummaryInformation stream contains the SummaryInformation property set.
* The DocumentSummaryInformation stream contains both the
* DocumentSummaryInformation and the UserDefined property sets as sections.
*/
typedef enum {
MS_OLE_PS_SUMMARY_INFO,
MS_OLE_PS_DOCUMENT_SUMMARY_INFO,
MS_OLE_PS_USER_DEFINED_SUMMARY_INFO
} MsOlePropertySetID;
typedef struct {
guint8 class_id[16];
GArray * sections;
GArray * items;
GList * write_items;
gboolean read_mode;
MsOleStream * s;
MsOlePropertySetID ps_id;
} MsOleSummary;
/* Could store the FID, but why bother ? */
typedef struct {
guint32 offset;
guint32 props;
guint32 bytes;
MsOlePropertySetID ps_id;
} MsOleSummarySection;
MsOleSummary *ms_ole_summary_open (MsOle *f);
MsOleSummary *ms_ole_docsummary_open (MsOle *f);
MsOleSummary *ms_ole_summary_open_stream (MsOleStream *stream,
const MsOlePropertySetID psid);
MsOleSummary *ms_ole_summary_create (MsOle *f);
MsOleSummary *ms_ole_docsummary_create (MsOle *f);
MsOleSummary *ms_ole_summary_create_stream (MsOleStream *s,
const MsOlePropertySetID psid);
GArray *ms_ole_summary_get_properties (MsOleSummary *si);
void ms_ole_summary_close (MsOleSummary *si);
/*
* Can be used to interrogate a summary item as to its type
*/
typedef enum {
MS_OLE_SUMMARY_TYPE_STRING = 0x10,
MS_OLE_SUMMARY_TYPE_TIME = 0x20,
MS_OLE_SUMMARY_TYPE_LONG = 0x30,
MS_OLE_SUMMARY_TYPE_SHORT = 0x40,
MS_OLE_SUMMARY_TYPE_BOOLEAN = 0x50,
MS_OLE_SUMMARY_TYPE_OTHER = 0x60
} MsOleSummaryType;
#define MS_OLE_SUMMARY_TYPE(x) ((MsOleSummaryType)((x)>>8))
/* FIXME MS_OLE_SUMMARY_THUMBNAIL is Preview, no Security, isn't it? */
/*
* The MS byte specifies the type, the LS byte is the
* 'standard' MS PID.
*/
typedef enum {
/* SummaryInformation Stream Properties */
/* String properties */
MS_OLE_SUMMARY_TITLE = 0x1002,
MS_OLE_SUMMARY_SUBJECT = 0x1003,
MS_OLE_SUMMARY_AUTHOR = 0x1004,
MS_OLE_SUMMARY_KEYWORDS = 0x1005,
MS_OLE_SUMMARY_COMMENTS = 0x1006,
MS_OLE_SUMMARY_TEMPLATE = 0x1007,
MS_OLE_SUMMARY_LASTAUTHOR = 0x1008,
MS_OLE_SUMMARY_REVNUMBER = 0x1009,
MS_OLE_SUMMARY_APPNAME = 0x1012,
/* Time properties */
MS_OLE_SUMMARY_TOTAL_EDITTIME = 0x200A,
MS_OLE_SUMMARY_LASTPRINTED = 0x200B,
MS_OLE_SUMMARY_CREATED = 0x200C,
MS_OLE_SUMMARY_LASTSAVED = 0x200D,
/* Long integer properties */
MS_OLE_SUMMARY_PAGECOUNT = 0x300E,
MS_OLE_SUMMARY_WORDCOUNT = 0x300F,
MS_OLE_SUMMARY_CHARCOUNT = 0x3010,
MS_OLE_SUMMARY_SECURITY = 0x3013,
/* Short integer properties */
MS_OLE_SUMMARY_CODEPAGE = 0x4001,
/* Security */
MS_OLE_SUMMARY_THUMBNAIL = 0x6011,
/* DocumentSummaryInformation Properties */
/* String properties */
MS_OLE_SUMMARY_CATEGORY = 0x1002,
MS_OLE_SUMMARY_PRESFORMAT = 0x1003,
MS_OLE_SUMMARY_MANAGER = 0x100E,
MS_OLE_SUMMARY_COMPANY = 0x100F,
/* Long integer properties */
MS_OLE_SUMMARY_BYTECOUNT = 0x3004,
MS_OLE_SUMMARY_LINECOUNT = 0x3005,
MS_OLE_SUMMARY_PARCOUNT = 0x3006,
MS_OLE_SUMMARY_SLIDECOUNT = 0x3007,
MS_OLE_SUMMARY_NOTECOUNT = 0x3008,
MS_OLE_SUMMARY_HIDDENCOUNT = 0x3009,
MS_OLE_SUMMARY_MMCLIPCOUNT = 0X300A,
/* Boolean properties */
MS_OLE_SUMMARY_SCALE = 0x500B,
MS_OLE_SUMMARY_LINKSDIRTY = 0x5010
} MsOleSummaryPID;
/* bit masks for security long integer */
#define MsOleSummaryAllSecurityFlagsEqNone 0x00
#define MsOleSummarySecurityPassworded 0x01
#define MsOleSummarySecurityRORecommended 0x02
#define MsOleSummarySecurityRO 0x04
#define MsOleSummarySecurityLockedForAnnotations 0x08
typedef struct {
GTimeVal time;
GDate date;
} MsOleSummaryTime;
typedef struct {
guint32 len;
guint8 *data;
} MsOleSummaryPreview;
gchar * ms_ole_summary_get_string (MsOleSummary *si,
MsOleSummaryPID id,
gboolean *available);
gboolean ms_ole_summary_get_boolean (MsOleSummary *si,
MsOleSummaryPID id,
gboolean *available);
guint16 ms_ole_summary_get_short (MsOleSummary *si,
MsOleSummaryPID id,
gboolean *available);
guint32 ms_ole_summary_get_long (MsOleSummary *si,
MsOleSummaryPID id,
gboolean *available);
GTimeVal ms_ole_summary_get_time (MsOleSummary *si,
MsOleSummaryPID id,
gboolean *available);
MsOleSummaryPreview ms_ole_summary_get_preview (MsOleSummary *si,
MsOleSummaryPID id,
gboolean *available);
void ms_ole_summary_preview_destroy (MsOleSummaryPreview d);
/* FIXME The next comment isn't true, is it?
Return TRUE if write is successful */
void ms_ole_summary_set_string (MsOleSummary *si,
MsOleSummaryPID id,
const gchar *str);
void ms_ole_summary_set_boolean (MsOleSummary *si,
MsOleSummaryPID id,
gboolean value);
void ms_ole_summary_set_short (MsOleSummary *si,
MsOleSummaryPID id,
guint16 i);
void ms_ole_summary_set_long (MsOleSummary *si,
MsOleSummaryPID id,
guint32 i);
void ms_ole_summary_set_time (MsOleSummary *si,
MsOleSummaryPID id,
GTimeVal time);
void ms_ole_summary_set_preview (MsOleSummary *si,
MsOleSummaryPID id,
const
MsOleSummaryPreview *
preview);
#endif /* MS_OLE_SUMMARY_H */
msOle结构:
/**
* Structure describing an OLE file
**/
struct _MsOle {
int ref_count;
gboolean ole_mmap;
guint8 *mem;
guint32 length;
MsOleSysWrappers *syswrap;
char mode;
int file_des;
int dirty;
GArray *bb; /* Big blocks status */
GArray *sb; /* Small block status */
GArray *sbf; /* The small block file */
guint32 num_pps; /* Count of number of property sets */
GList *pps; /* Property Storage -> struct _PPS, always 1 valid entry or NULL */
/* if memory mapped */
GPtrArray *bbattr; /* Pointers to block structures */
/* end if memory mapped */
};
其他资源:
http://slackware.mirrors.pair.com/slackware-8.1/source/gnome/libole2/libole2-0.2.4.tar.bz2 ftp://ftp.ca.com/caproducts/Opal/jasmine064/framework/include/
参考: http://wvware.sourceforge.net/libole2/libole2.html
我已经厌倦了这种方式 - 但我没有找到pagecount:
echo("<pre>");
$file = "files/doctest.doc";
if(!is_file($file))die("File not found.");
//bind file to a stream.
$handle = fopen($file, "rb");
//read file content
$content = fread($handle, filesize($file));
$binaryfile = "";
for ($i = 0; $i < strlen($content); $i++) {
//get ascii char
$char = $content[$i];
//get the acsii value 0-255 (2^8)
$decimal = ord($char);
//decimal number in base 200
$binary = base_convert($decimal, 10, 2);
echo($char);
echo sprintf(" %3d %08b",$decimal,$decimal);
if($i % 4==0)echo("*");
$bit32 = b($content[$i]).b($content[$i+1]).b($content[$i+2]).b($content[$i+3]);
echo sprintf("<br><b>%d</b>",base_convert($bit32,2,10)); //32bit int
echo("<br>");
}
fclose($handle);
感谢您的帮助!
答案 0 :(得分:1)
word文档是一种格式非常复杂的文件。 该文件位于Windows复合二进制文件
中包含的流中规范要求知道二进制(小端的东西)和FAT(因为它在格式中使用FAT)和所有其他类型的东西。
不使用COM
我认为你不在Windows中(或者你现在已经使用过COM / OLE)所以这里有一个可以准备和操作Windows CDF文件的程序。它不是一个框架,而是一个可以使用system("cdfprogram file.doc")
内置php函数调用的程序。
另一个准备好文字文件的程序
在这里,您安装并使用system()
或其任何等效的兄弟姐妹进行呼叫。
为什么Microsoft Office文件格式如此复杂?
由于以下原因:
参考:http://www.joelonsoftware.com/items/2008/02/19.html
<强>结论强>
没有简单的方法可以使用PHP内置函数从word文件中获取页数。您必须阅读Microsoft的所有规范并自己构建解析器。这是一个独立的项目。我认为任何人都不会免费为你做这件事。
为什么没人试过?可能是因为当有图书馆和框架已经完成工作时,没有人愿意投入那么多时间。这是我的意见。
<强>建议强>
如何创建在Windows机器上运行的Web服务(您将可以访问COM库),并且您的主应用程序可以将word文件发布到您的Windows Web服务,并且您的Web服务会将页面数量返回给您主要应用。使用COM,它很容易。
您可以执行此异步操作,以便在等待网络服务回复时,您的上传速度不会降低,并且上传可以处于“待处理验证”状态。
如果您使用的是Web服务,则它不必与PHP本身位于同一服务器上。
Web服务会执行以下操作:
<?php
$word = new COM("word.application");
if (!$word) {
echo ("Could not initialise MS Word object.\n");
exit(1);
}
$word->Documents->Open(realpath("C:\\Test\\t.doc"));
$pages = $word->ActiveDocument->BuiltInDocumentProperties(14);
echo "Number of pages: " . $pages->value;
$word->ActiveDocument->Close(false);
$word->Quit();
$word = null;
unset($word);