Question

使用OpenXML，我可以按页码阅读文档内容吗？

wordDocument.MainDocumentPart.Document.Body提供完整文档的内容。

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                int pageCount = 0;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                for (int i = 1; i <= pageCount; i++)
                {
                    //Read the content by page number
                }
            }
        }

MSDN Reference

更新1：

看起来像分页符设置如下

<w:p w:rsidR="003328B0" w:rsidRDefault="003328B0">
        <w:r>
            <w:br w:type="page" />
        </w:r>
    </w:p>

所以现在我需要通过上面的检查拆分XML并为每个检查取InnerTex，这将为我提供页面文本。

现在问题变成了如何用上面的检查拆分XML？

更新2：

只有当您有分页符时才设置分页符，但如果文本从一个页面浮动到其他页面，则没有设置分页符XML元素，因此它将恢复到同样的挑战如何识别页面分隔符

Answer 1

您无法单独在OOXML数据级别通过页码编号引用OOXML内容。

硬分页 不是问题;可以计算硬分页数。
软分页 是问题所在。这些是根据计算的换行和分页算法是实现的依赖;它不是OOXML数据的固有特征。空无一物数数。

w:lastRenderedPageBreak怎么样，它是上次呈现文档时软分页符位置的记录？ 不，w:lastRenderedPageBreak一般没有帮助，因为：

根据定义，当内容具有时，w:lastRenderedPageBreak位置是陈旧的自从最后一个打开它的程序打开以来已被更改内容。
在MS Word的实施中，w:lastRenderedPageBreak在各种情况下都是不可靠的，包括：

如果您愿意接受对Word Automation及其所有固有licensing and server operation limitations的依赖，那么您就有可能确定页面边界，页面编号，页数等。

否则，唯一真正的答案是超越基于页面的引用框架，这些框架依赖于专有的，特定于实现的分页算法。

Answer 2

这就是我最终做到的方式。

  public void OpenWordprocessingDocumentReadonly()
        {
            string filepath = @"C:\...\test.docx";
            // Open a WordprocessingDocument based on a filepath.
            Dictionary<int, string> pageviseContent = new Dictionary<int, string>();
            int pageCount = 0;
            using (WordprocessingDocument wordDocument =
                WordprocessingDocument.Open(filepath, false))
            {
                // Assign a reference to the existing document body.  
                Body body = wordDocument.MainDocumentPart.Document.Body;
                if (wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text != null)
                {
                    pageCount = Convert.ToInt32(wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text);
                }
                int i = 1;
                StringBuilder pageContentBuilder = new StringBuilder();
                foreach (var element in body.ChildElements)
                {
                    if (element.InnerXml.IndexOf("<w:br w:type=\"page\" />", StringComparison.OrdinalIgnoreCase) < 0)
                    {
                        pageContentBuilder.Append(element.InnerText);
                    }
                    else
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                        i++;
                        pageContentBuilder = new StringBuilder();
                    }
                    if (body.LastChild == element && pageContentBuilder.Length > 0)
                    {
                        pageviseContent.Add(i, pageContentBuilder.ToString());
                    }
                }
            }
        }

缺点：这在所有情况下都不起作用。这仅在您有分页符时才有效，但如果您将文本从第1页扩展到第2页，则没有标识符可以知道您在第二页。

Answer 3

将docx重命名为zip。打开docProps \ app.xml文件。：

 <?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<Properties xmlns="http://schemas.openxmlformats.org/officeDocument/2006/extended-properties" xmlns:vt="http://schemas.openxmlformats.org/officeDocument/2006/docPropsVTypes">
  <Template>Normal</Template>
  <TotalTime>0</TotalTime>
  <Pages>1</Pages>
  <Words>141</Words>
  <Characters>809</Characters>
  <Application>Microsoft Office Word</Application>
  <DocSecurity>0</DocSecurity>
  <Lines>6</Lines>
  <Paragraphs>1</Paragraphs>
  <ScaleCrop>false</ScaleCrop>
  <HeadingPairs>
    <vt:vector size="2" baseType="variant">
      <vt:variant>
        <vt:lpstr>Название</vt:lpstr>
      </vt:variant>
      <vt:variant>
        <vt:i4>1</vt:i4>
      </vt:variant>
    </vt:vector>
  </HeadingPairs>
  <TitlesOfParts>
    <vt:vector size="1" baseType="lpstr">
      <vt:lpstr/>
    </vt:vector>
  </TitlesOfParts>
  <Company/>
  <LinksUpToDate>false</LinksUpToDate>
  <CharactersWithSpaces>949</CharactersWithSpaces>
  <SharedDoc>false</SharedDoc>
  <HyperlinksChanged>false</HyperlinksChanged>
  <AppVersion>14.0000</AppVersion>
</Properties>

OpenXML库从<Pages>1</Pages> property中读取wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text。此属性仅由winword应用程序创建。如果Word文档已更改，则wordDocument.ExtendedFilePropertiesPart.Properties.Pages.Text不实际。如果以编程方式创建的Word文档，则将wordDocument.ExtendedFilePropertiesPart设置为null。

Answer 4

列表与LT;段落＆GT; Allparagraphs = wp.MainDocumentPart.Document.Body.OfType＆lt; Paragraph＆gt;（）。ToList（）;

列表与LT;段落＆GT; PageParagraphs = Allparagraphs.Where（x =＆gt; x.Descendants＆lt; LastRenderedPageBreak＆gt;（）。Count（）== 1）.Select（x =＆gt; x）.Distinct（）。ToList（）;

如何按页码访问OpenXML内容？

4 个答案: