如何从ms word(.DOC)文件中获取摘要信息?

时间:2012-03-14 14:12:55

标签: php c

我只需要PHP中的页面计数属性,而不仅使用内置函数(不是框架和COM)。输入是“旧”doc文件。

以下是我所知道的,我发现了这个主题,希望它能帮助您解决问题:

SummarayInformation看起来像这样,它被加密到文件代码中: enter image description here

我找到了一些可以找到提取数据的方法的C文件,但是我很难理解它。

#include <stdlib.h>
#include <stdio.h>

#include "wv_Base.h"
#include "wv_Common.h"
#include "wv.h"

#include "glib.h"
#include "ms-ole.h"
#include "ms-ole-summary.h"


/*
 * This is a simple example that take an ole file and prints some
 * information from the summaryinformation stream
 */


int main(int argc, char *argv[])
    {
    char *str = NULL;
    int ret = 0;
    short s = 0;
    long l = 0;

    MsOle *ole = NULL;
    MsOleSummary *summary = NULL;

    if (argc < 2)
        {
        fprintf(stderr, "Usage: wvSummary oledocument\n");
        return(1);
        }

    ms_ole_open(&ole, argv[1]);
    if (!ole)
        {
        fprintf(stderr,"sorry problem with getting ole streams from %s\n",argv[1]);
        return 1;
        }

    summary = ms_ole_summary_open(ole);
    if (!summary)
        {
        fprintf(stderr, "Could not open summary stream\n");
        return 1;
        }

    ms_ole_summary_get_string(summary, , &ret);

    if (ret)
      printf("PageCount is %d\n", l);
    else
      printf("no pagecount\n");


    ms_ole_summary_close(summary);
    ms_ole_destroy(&ole);

    return 0;
    }

关于MS_OLE_SUMMARY_TITLE:

/**
 * ms-ole-summary.h: MS Office OLE support
 *
 * Author:
 *    Michael Meeks (michael@imaginator.com)
 * From work by:
 *    Caolan McNamara (Caolan.McNamara@ul.ie)
 * Built on work by:
 *    Somar Software's CPPSUM (http://www.somar.com)
 *
 * Copyright 1998-2000 Helix Code, Inc., Frank Chiulli, and others.
 **/

#ifndef MS_OLE_SUMMARY_H
#define MS_OLE_SUMMARY_H

#include <time.h>
#include <libole2/ms-ole.h>

/*
 * MS Ole Property Set IDs
 * The SummaryInformation stream contains the SummaryInformation property set.
 * The DocumentSummaryInformation stream contains both the
 * DocumentSummaryInformation and the UserDefined property sets as sections.
 */
typedef enum {
    MS_OLE_PS_SUMMARY_INFO,
    MS_OLE_PS_DOCUMENT_SUMMARY_INFO,
    MS_OLE_PS_USER_DEFINED_SUMMARY_INFO
} MsOlePropertySetID;

typedef struct {
    guint8          class_id[16];
    GArray *        sections;
    GArray *        items;
    GList *         write_items;
    gboolean        read_mode;
    MsOleStream *       s;
    MsOlePropertySetID  ps_id;
} MsOleSummary;

/* Could store the FID, but why bother ? */
typedef struct {
    guint32         offset;
    guint32         props;
    guint32         bytes;
    MsOlePropertySetID  ps_id;
} MsOleSummarySection;

MsOleSummary *ms_ole_summary_open       (MsOle *f);
MsOleSummary *ms_ole_docsummary_open        (MsOle *f);
MsOleSummary *ms_ole_summary_open_stream    (MsOleStream *stream,
                         const MsOlePropertySetID psid);
MsOleSummary *ms_ole_summary_create     (MsOle *f);
MsOleSummary *ms_ole_docsummary_create      (MsOle *f);
MsOleSummary *ms_ole_summary_create_stream  (MsOleStream *s,
                         const MsOlePropertySetID psid);
GArray       *ms_ole_summary_get_properties (MsOleSummary *si);
void          ms_ole_summary_close      (MsOleSummary *si);


/*
 * Can be used to interrogate a summary item as to its type
 */
typedef enum {
    MS_OLE_SUMMARY_TYPE_STRING  = 0x10,
    MS_OLE_SUMMARY_TYPE_TIME    = 0x20,
    MS_OLE_SUMMARY_TYPE_LONG    = 0x30,
    MS_OLE_SUMMARY_TYPE_SHORT   = 0x40,
    MS_OLE_SUMMARY_TYPE_BOOLEAN = 0x50,
    MS_OLE_SUMMARY_TYPE_OTHER   = 0x60
} MsOleSummaryType;

#define MS_OLE_SUMMARY_TYPE(x) ((MsOleSummaryType)((x)>>8))

/* FIXME MS_OLE_SUMMARY_THUMBNAIL is Preview, no Security, isn't it? */
/*
 *  The MS byte specifies the type, the LS byte is the
 * 'standard' MS PID.
 */
typedef enum {
/* SummaryInformation Stream Properties */
/* String properties */
    MS_OLE_SUMMARY_TITLE          = 0x1002,
    MS_OLE_SUMMARY_SUBJECT        = 0x1003,
    MS_OLE_SUMMARY_AUTHOR         = 0x1004,
    MS_OLE_SUMMARY_KEYWORDS       = 0x1005,
    MS_OLE_SUMMARY_COMMENTS       = 0x1006,
    MS_OLE_SUMMARY_TEMPLATE       = 0x1007,
    MS_OLE_SUMMARY_LASTAUTHOR     = 0x1008,
    MS_OLE_SUMMARY_REVNUMBER      = 0x1009,
    MS_OLE_SUMMARY_APPNAME        = 0x1012,

/* Time properties */
    MS_OLE_SUMMARY_TOTAL_EDITTIME = 0x200A,
    MS_OLE_SUMMARY_LASTPRINTED    = 0x200B,
    MS_OLE_SUMMARY_CREATED        = 0x200C,
    MS_OLE_SUMMARY_LASTSAVED      = 0x200D,

/* Long integer properties */
    MS_OLE_SUMMARY_PAGECOUNT      = 0x300E,
    MS_OLE_SUMMARY_WORDCOUNT      = 0x300F,
    MS_OLE_SUMMARY_CHARCOUNT      = 0x3010,
    MS_OLE_SUMMARY_SECURITY       = 0x3013,

/* Short integer properties */
    MS_OLE_SUMMARY_CODEPAGE       = 0x4001,

/* Security */  
    MS_OLE_SUMMARY_THUMBNAIL      = 0x6011,


/* DocumentSummaryInformation Properties */
/* String properties */
    MS_OLE_SUMMARY_CATEGORY       = 0x1002,
    MS_OLE_SUMMARY_PRESFORMAT     = 0x1003,
    MS_OLE_SUMMARY_MANAGER        = 0x100E,
    MS_OLE_SUMMARY_COMPANY        = 0x100F,

/* Long integer properties */
    MS_OLE_SUMMARY_BYTECOUNT      = 0x3004,
    MS_OLE_SUMMARY_LINECOUNT      = 0x3005,
    MS_OLE_SUMMARY_PARCOUNT       = 0x3006,
    MS_OLE_SUMMARY_SLIDECOUNT     = 0x3007,
    MS_OLE_SUMMARY_NOTECOUNT      = 0x3008,
    MS_OLE_SUMMARY_HIDDENCOUNT    = 0x3009,
    MS_OLE_SUMMARY_MMCLIPCOUNT    = 0X300A,

/* Boolean properties */
    MS_OLE_SUMMARY_SCALE          = 0x500B,
    MS_OLE_SUMMARY_LINKSDIRTY     = 0x5010
} MsOleSummaryPID;


/* bit masks for security long integer */
#define MsOleSummaryAllSecurityFlagsEqNone        0x00
#define MsOleSummarySecurityPassworded            0x01
#define MsOleSummarySecurityRORecommended         0x02
#define MsOleSummarySecurityRO                    0x04
#define MsOleSummarySecurityLockedForAnnotations  0x08

typedef struct {
    GTimeVal time;
    GDate    date;
} MsOleSummaryTime;

typedef struct {
    guint32 len;
    guint8 *data;
} MsOleSummaryPreview;

gchar *         ms_ole_summary_get_string   (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
gboolean        ms_ole_summary_get_boolean  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
guint16         ms_ole_summary_get_short    (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
guint32         ms_ole_summary_get_long     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
GTimeVal        ms_ole_summary_get_time     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
MsOleSummaryPreview ms_ole_summary_get_preview  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean *available);
void            ms_ole_summary_preview_destroy  (MsOleSummaryPreview d);

/* FIXME The next comment isn't true, is it?
   Return TRUE if write is successful */
void            ms_ole_summary_set_string   (MsOleSummary *si,
                             MsOleSummaryPID id,
                             const gchar *str);
void            ms_ole_summary_set_boolean  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             gboolean value);
void            ms_ole_summary_set_short    (MsOleSummary *si,
                             MsOleSummaryPID id,
                             guint16 i);
void            ms_ole_summary_set_long     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             guint32 i);
void            ms_ole_summary_set_time     (MsOleSummary *si,
                             MsOleSummaryPID id,
                             GTimeVal time);
void            ms_ole_summary_set_preview  (MsOleSummary *si,
                             MsOleSummaryPID id,
                             const
                             MsOleSummaryPreview *
                             preview);

#endif  /* MS_OLE_SUMMARY_H */

msOle结构:

/**
 * Structure describing an OLE file
 **/
struct _MsOle {
    int               ref_count;
    gboolean          ole_mmap;
    guint8           *mem;
    guint32           length;
    MsOleSysWrappers *syswrap;

    char              mode;
    int               file_des;
    int               dirty;
    GArray           *bb;      /* Big  blocks status  */
    GArray           *sb;      /* Small block status  */
    GArray           *sbf;     /* The small block file */
    guint32           num_pps; /* Count of number of property sets */
    GList            *pps;     /* Property Storage -> struct _PPS, always 1 valid entry or NULL */
/* if memory mapped */
    GPtrArray        *bbattr;  /* Pointers to block structures */
/* end if memory mapped */
};

其他资源:

http://slackware.mirrors.pair.com/slackware-8.1/source/gnome/libole2/libole2-0.2.4.tar.bz2 ftp://ftp.ca.com/caproducts/Opal/jasmine064/framework/include/

参考: http://wvware.sourceforge.net/libole2/libole2.html

我已经厌倦了这种方式 - 但我没有找到pagecount:

echo("<pre>");
$file = "files/doctest.doc";
if(!is_file($file))die("File not found.");

//bind file to a stream.
$handle = fopen($file, "rb");

//read file content
$content = fread($handle, filesize($file));

$binaryfile = "";
for ($i = 0; $i < strlen($content); $i++) {
    //get ascii char
    $char = $content[$i];

    //get the acsii value 0-255 (2^8)
    $decimal = ord($char);

    //decimal number in base 200
    $binary =  base_convert($decimal, 10, 2);

    echo($char);

    echo sprintf(" %3d %08b",$decimal,$decimal);
    if($i % 4==0)echo("*");

    $bit32 = b($content[$i]).b($content[$i+1]).b($content[$i+2]).b($content[$i+3]);
    echo sprintf("<br><b>%d</b>",base_convert($bit32,2,10)); //32bit int


    echo("<br>");
}
fclose($handle);

感谢您的帮助!

1 个答案:

答案 0 :(得分:1)

word文档是一种格式非常复杂的文件。 该文件位于Windows复合二进制文件

中包含的流中

规范要求知道二进制(小端的东西)和FAT(因为它在格式中使用FAT)和所有其他类型的东西。

不使用COM

我认为你不在Windows中(或者你现在已经使用过COM / OLE)所以这里有一个可以准备和操作Windows CDF文件的程序。它不是一个框架,而是一个可以使用system("cdfprogram file.doc")内置php函数调用的程序。

另一个准备好文字文件的程序

在这里,您安装并使用system()或其任何等效的兄弟姐妹进行呼叫。

为什么Microsoft Office文件格式如此复杂?

由于以下原因:

  1. 它们被设计为在非常旧的计算机上快速运行。
  2. 它们旨在使用库。
  3. 它们的设计并未考虑互操作性。
  4. 他们必须反映应用程序的所有复杂性。
  5. 他们必须反映应用程序的历史。
  6. 参考:http://www.joelonsoftware.com/items/2008/02/19.html

    <强>结论

    没有简单的方法可以使用PHP内置函数从word文件中获取页数。您必须阅读Microsoft的所有规范并自己构建解析器。这是一个独立的项目。我认为任何人都不会免费为你做这件事。

    为什么没人试过?可能是因为当有图书馆和框架已经完成工作时,没有人愿意投入那么多时间。这是我的意见。

    <强>建议

    如何创建在Windows机器上运行的Web服务(您将可以访问COM库),并且您的主应用程序可以将word文件发布到您的Windows Web服务,并且您的Web服务会将页面数量返回给您主要应用。使用COM,它很容易。

    您可以执行此异步操作,以便在等待网络服务回复时,您的上传速度不会降低,并且上传可以处于“待处理验证”状态。

    如果您使用的是Web服务,则它不必与PHP本身位于同一服务器上。

    Web服务会执行以下操作:

    <?php
    
    $word = new COM("word.application");
    if (!$word) {
      echo ("Could not initialise MS Word object.\n"); 
      exit(1);
    }
    $word->Documents->Open(realpath("C:\\Test\\t.doc")); 
    
    $pages = $word->ActiveDocument->BuiltInDocumentProperties(14); 
    echo "Number of pages: " . $pages->value;
    
    $word->ActiveDocument->Close(false); 
    $word->Quit(); 
    $word = null; 
    unset($word);