以编程方式访问Word 2007文档的文档属性

时间:2010-04-14 14:49:09

标签: powershell ms-word

我是否可以通过编程方式访问Word 2007文档的文档属性?

我愿意为此使用任何语言,但理想情况下可能是通过PowerShell脚本。

我的总体目标是在文件系统的某处遍历文档,从这些文档中解析一些文档属性,然后将所有这些属性整理回一个新的Word文档。

我基本上想要自动创建一个文档,该文档是文件系统某个文件夹下所有文档的列表;此列表将包含标题抽象作者文档属性等内容; CreateDate 字段;等每个文件。

5 个答案:

答案 0 :(得分:4)

答案 1 :(得分:2)

我需要在没有安装MS Office应用程序的服务器上运行的PowerShell中执行此操作。如上所述,诀窍是窥视office文件并检查其中的嵌入式xml文件。

这是一个像cmdlet一样运行的函数,这意味着您只需将脚本保存在PowerShell脚本目录中,并从任何其他PowerShell脚本调用该函数。

# DocumentOfficePropertiesGet
# Example usage
#   From a PowerShell script:
#       $props = Invoke-Expression "c:\PowerShellScriptFolder\DocumentOfficePropertiesGet.ps1 -DocumentFullPathName ""d:\documents\my excel doc.xlsx"" -OfficeProperties ""dcterms:created;dcterms:modified"""

# Parameters

#    DocumentFullPathName -- full path and name of MS Office document
#    OfficeProperties -- semi-colon delimited string of property names as they
#              appear in the core.xml file. To see these names, rename any
#              MS Office document file to have the extension .zip, then look inside
#              the zip file. In the docProps folder open the core.xml file. The
#              core document properties are nodes under the cp:coreProperties node.

#         Example: dcterms:created;dcterms:modified;cp:lastModifiedBy

# Return value

#   The function returns a hashtable object -- in the above example, $props would contain
#   the name-value pairs for the requested MS Office document properties. In the calling script,
#   to get at the values:

#        $fooProperty = $props.'dcterms:created'
#        $barProperty = $props.'dcterms:modified'

[CmdletBinding()]
    [OutputType([System.Collections.Hashtable])]
    Param
    (
        [Parameter(Position=0,
            Mandatory=$false,
            HelpMessage="Enter the full path name of the document")]
            [ValidateNotNullOrEmpty()]
            [String] $DocumentFullPathName='e:\temp\supplier_List.xlsx',
        [Parameter(Position=1,
            Mandatory=$false,
            HelpMessage="Enter the Office properties semi-colon delimited")]
            [ValidateNotNullOrEmpty()]
            [String] $OfficeProperties='dcterms:created; dcterms:modified ;cp:lastModifiedBy;dc:creator'
    )
# We need the FileSystem assembly
Add-Type -AssemblyName System.IO.Compression.FileSystem

# This function unzips a zip file -- and it works on MS Office files directly: no need to
# rename them from foo.xlsx to foo.zip. It expects the full path name of the zip file
# and the path name for the unzipped files
function Unzip
{
    param([string]$zipfile, [string]$outpath)

    [System.IO.Compression.ZipFile]::ExtractToDirectory($zipfile, $outpath) *>$null
}

# Remove spaces from the OfficeProperties parameter
$OfficeProperties = $OfficeProperties.replace(' ','')

# Compose the name of the folder where we will unzip files
$zipDirectoryName = $env:TEMP + "\" + "TempZip"

# delete the zip directory if present
remove-item $zipDirectoryName -force -recurse -ErrorAction Ignore | out-null

# create the zip directory
New-Item -ItemType directory -Path $zipDirectoryName | out-null

# Unzip the files -- i.e. extract the xml files embedded within the MS Office document
unzip $DocumentFullPathName $zipDirectoryName

# get the docProps\core.xml file as [xml]
$coreXmlName = $zipDirectoryName + "\docProps\core.xml"
[xml]$coreXml = get-content -path $coreXmlName

# create an array of the requested properties
$requiredProperties = $OfficeProperties -split ";"

# create a hashtable to return the values
$docProperties = @{}

# Now look for each requested property
foreach($requiredProperty in $requiredProperties)
{
    # We will be lazy and ignore the namespaces. We need the local name only
    $localName = $requiredProperty -split ":"
    $localName = $localName[1]
    # Use XPath to fetch the node for this property
    $thisNode = $coreXml.coreProperties.SelectSingleNode("*[local-name(.) = '$localName']")
    if($thisNode -eq $null)
    {
        # To the hashtable, add the requested property name and its value -- null in this case
        $docProperties.Add($RequiredProperty, $null)
    }
    else
    {
        # To the hashtable, add the requested property name and its value
        $docProperties.Add($RequiredProperty, $thisNode.innerText)
    }
}

#clean up
remove-item $zipDirectoryName -force -recurse

# return the properties hashtable. To do this, just write the object to the output stream
$docProperties

答案 2 :(得分:1)

我的猜测是你最好的选择是VB or C#Office Interop Assemblies。我没有意识到本地方式(在Powershell中)做你想做的事。

也就是说,如果你使用VB或C#,你可以写一个powershell cmdlet给你的整理。但在那时,编写一个作为计划任务运行的控制台应用程序可能会更简单。

答案 3 :(得分:1)

我最近从观看DNRTV剧集中了解到,Office 2007文档只是压缩的XML。因此,您可以将“Document.docx”更改为“Document.docx.zip”并查看其中的XML文件。您可以通过.NET中的互操作程序集获取属性,但是直接查看XML可能更有效(可能使用LINQ to XML或我不知道的某些本机方式)。

答案 4 :(得分:1)

wrote up how to do this回到了Monad测试日。它应该仍然有用。我想。