OpenXML标记搜索

时间:2015-02-24 13:57:44

标签: c# .net ms-word openxml

我正在编写一个.NET应用程序,它应该读取200页长的.docx文件(通过DocumentFormat.OpenXML 2.5)来查找文档应该包含的某些标记的所有出现。 为了清楚起见,我不是在寻找OpenXML标签,而是寻找应该由文档编写者设置到文档中的标签,作为我需要在第二阶段填写的值的占位符。 此类标签应采用以下格式:

 <!TAG!>

(其中TAG可以是任意字符序列)。 正如我所说,我必须找到所有这些标签的出现加上(如果可能的话)找到已找到标签出现的“页面”。 我在Web上发现了一些东西,但不止一次基本方法是将文件的所有内容转储到字符串中,然后查看这样的字符串,无论.docx编码如何。这或者导致误报或者根本没有匹配(虽然测试.docx文件包含多个标签),其他示例可能与我对OpenXML的了解有点差异。 找到这样的标签的正则表达式模式应该是这样的:

<!(.)*?!>

可以在整个文档中找到标记(在表格,文本,段落内,也可以在页眉和页脚中找到)。

我在Visual Studio 2013 .NET 4.5中进行编码,但如果需要,我可以回来。 附:我更喜欢不使用Office Interop API的代码,因为目标平台不会运行Office。

我可以生成的最小.docx示例存储在文档内

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<w:document xmlns:wpc="http://schemas.microsoft.com/office/word/2010/wordprocessingCanvas" xmlns:mc="http://schemas.openxmlformats.org/markup-compatibility/2006" xmlns:o="urn:schemas-microsoft-com:office:office" xmlns:r="http://schemas.openxmlformats.org/officeDocument/2006/relationships" xmlns:m="http://schemas.openxmlformats.org/officeDocument/2006/math" xmlns:v="urn:schemas-microsoft-com:vml" xmlns:wp14="http://schemas.microsoft.com/office/word/2010/wordprocessingDrawing" xmlns:wp="http://schemas.openxmlformats.org/drawingml/2006/wordprocessingDrawing" xmlns:w10="urn:schemas-microsoft-com:office:word" xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main" xmlns:w14="http://schemas.microsoft.com/office/word/2010/wordml" xmlns:w15="http://schemas.microsoft.com/office/word/2012/wordml" xmlns:wpg="http://schemas.microsoft.com/office/word/2010/wordprocessingGroup" xmlns:wpi="http://schemas.microsoft.com/office/word/2010/wordprocessingInk" xmlns:wne="http://schemas.microsoft.com/office/word/2006/wordml" xmlns:wps="http://schemas.microsoft.com/office/word/2010/wordprocessingShape" mc:Ignorable="w14 w15 wp14">
<w:body>
<w:p w:rsidR="00CA7780" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:proofErr w:type="gramStart"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>&lt;!TAG1</w:t>
  </w:r>
  <w:proofErr w:type="gramEnd"/>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!&gt;</w:t>
  </w:r>
</w:p>
<w:p w:rsidR="00815E5D" w:rsidRPr="00815E5D" w:rsidRDefault="00815E5D">
  <w:pPr>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
  </w:pPr>
  <w:r>
    <w:rPr>
      <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>TRY2</w:t>
  </w:r>
  <w:bookmarkStart w:id="0" w:name="_GoBack"/>
  <w:bookmarkEnd w:id="0"/>
</w:p>
<w:sectPr w:rsidR="00815E5D" w:rsidRPr="00815E5D">
  <w:pgSz w:w="11906" w:h="16838"/>
  <w:pgMar w:top="1417" w:right="1134" w:bottom="1134" w:left="1134" w:header="708" w:footer="708" w:gutter="0"/>
  <w:cols w:space="708"/>
  <w:docGrid w:linePitch="360"/>
</w:sectPr>
</w:body>
</w:document>

最诚挚的问候,  麦克

3 个答案:

答案 0 :(得分:5)

尝试查找标记的问题在于,单词并不总是位于基础XML中,而是采用它们在Word中的格式。例如,在您的示例XML中,<!TAG1!>标记分为多个运行,如下所示:

<w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>&lt;!TAG1</w:t>
</w:r>
<w:proofErr w:type="gramEnd"/>
    <w:r>
    <w:rPr>
        <w:lang w:val="en-GB"/>
    </w:rPr>
    <w:t>!&gt;</w:t>
</w:r>

正如评论中所指出的,这有时是由拼写和语法检查引起的,但并非所有这些都可能导致它。例如,在标签的各个部分上使用不同的样式也可能会导致它。

处理此问题的一种方法是找到InnerText的{​​{1}}并将其与Paragraph进行比较。 Regex属性将返回段落的纯文本,而基础文档中的任何格式或其他XML都不会妨碍。

获得标签后,替换文字是下一个问题。由于上述原因,您无法仅使用一些新文本替换InnerText,因为它不清楚文本的哪些部分属于哪个InnerText。最简单的方法是删除所有现有的Run并添加一个新的Run,其中Run属性包含新文本。

以下代码显示了找到标记并立即替换它们,而不是像您在问题中建议的那样使用两遍。这只是为了让事实更简单。它应该显示你需要的一切。

Text

上述方法唯一的缺点是你可能拥有的任何款式都会丢失。这些可以从现有的private static void ReplaceTags(string filename) { Regex regex = new Regex("<!(.)*?!>", RegexOptions.Compiled); using (WordprocessingDocument wordDocument = WordprocessingDocument.Open(filename, true)) { //grab the header parts and replace tags there foreach (HeaderPart headerPart in wordDocument.MainDocumentPart.HeaderParts) { ReplaceParagraphParts(headerPart.Header, regex); } //now do the document ReplaceParagraphParts(wordDocument.MainDocumentPart.Document, regex); //now replace the footer parts foreach (FooterPart footerPart in wordDocument.MainDocumentPart.FooterParts) { ReplaceParagraphParts(footerPart.Footer, regex); } } } private static void ReplaceParagraphParts(OpenXmlElement element, Regex regex) { foreach (var paragraph in element.Descendants<Paragraph>()) { Match match = regex.Match(paragraph.InnerText); if (match.Success) { //create a new run and set its value to the correct text //this must be done before the child runs are removed otherwise //paragraph.InnerText will be empty Run newRun = new Run(); newRun.AppendChild(new Text(paragraph.InnerText.Replace(match.Value, "some new value"))); //remove any child runs paragraph.RemoveAllChildren<Run>(); //add the newly created run paragraph.AppendChild(newRun); } } } 复制,但如果有多个Run具有不同的属性,则您需要确定需要复制哪些Run哪里。没有什么可以阻止你在上面的代码中创建多个Run,每个代码都有不同的属性,如果需要的话。

答案 1 :(得分:1)

不确定SDK是否更好但是这样可以生成包含标记名称的字典和可以将新值设置为的元素:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Text.RegularExpressions;
using System.Threading.Tasks;
using System.Xml.Linq;

namespace ConsoleApplication8
{
    class Program
    {
        static void Main(string[] args)
        {
            Dictionary<string, XElement> lookupTable = new Dictionary<string, XElement>();
            Regex reg = new Regex(@"\<\!(?<TagName>.*)\!\>");

            XDocument doc = XDocument.Load("document.xml");
            XNamespace ns = doc.Root.GetNamespaceOfPrefix("w");
            IEnumerable<XElement> elements = doc.Root.Descendants(ns + "t").Where(x=> x.Value.StartsWith("<!")).ToArray();
            foreach (var item in elements)
            {
                #region remove the grammar tag
                //before
                XElement grammar = item.Parent.PreviousNode as XElement;
                grammar.Remove();
                //after
                grammar = item.Parent.NextNode as XElement;
                grammar.Remove();
                #endregion
                #region merge the two nodes and insert the name and the XElement to the dictionary
                XElement next = (item.Parent.NextNode as XElement).Element(ns + "t");
                string totalTagName = string.Format("{0}{1}", item.Value, next.Value);
                item.Parent.NextNode.Remove();
                item.Value = totalTagName;
                lookupTable.Add(reg.Match(totalTagName).Groups["TagName"].Value, item);
                #endregion
            }
            foreach (var item in lookupTable)
            {
                Console.WriteLine("The document contains a tag {0}" , item.Key);
                Console.WriteLine(item.Value.ToString());
            }


        }
    }
}

修改

您可以制作的更完整的可能结构示例:

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Xml.Linq;
using System.IO.Compression; //you will have to add a reference to System.IO.Compression.FileSystem(.dll)
using System.IO;
using System.Text.RegularExpressions;

namespace ConsoleApplication28
{
    public class MyWordDocument
    {
        #region fields

        private string fileName;
        private XDocument document;
        //todo: create fields for all document xml files that can contain the placeholders

        private Dictionary<string, List<XElement>> lookUpTable;

        #endregion

        #region properties

        public IEnumerable<string> Tags { get { return lookUpTable.Keys; } }

        #endregion

        #region construction

        public MyWordDocument(string fileName)
        {
            this.fileName = fileName;
            ExtractDocument();
            CreateLookUp();
        }

        #endregion
        #region methods

        public void ReplaceTagWithValue(string tagName, string value)
        {
            foreach (var item in lookUpTable[tagName])
            {
                item.Value = item.Value.Replace(string.Format(@"<!{0}!>", tagName),value);
            }
        }

        public void Save(string fileName)
        {
            document.Save(@"temp\word\document.xml");
            //todo: save other parts of document here i.e. footer header or other stuff

            ZipFile.CreateFromDirectory("temp", fileName);
        }

        private void CreateLookUp()
        {
            //todo: make this work for all cases and for all files that can contain the placeholders
            //tip: open the raw document in word and replace the tags,
            //     save the file to different location and extract the xmlfiles of both versions and compare to see what you have to do
            lookUpTable = new Dictionary<string, List<XElement>>();
            Regex reg = new Regex(@"\<\!(?<TagName>.*)\!\>");
            document = XDocument.Load(@"temp\word\document.xml");
            XNamespace ns = document.Root.GetNamespaceOfPrefix("w");
            IEnumerable<XElement> elements = document.Root.Descendants(ns + "t").Where(NodeGotSplitUpIn2PartsDueToGrammarCheck).ToArray();
            foreach (var item in elements)
            {
                XElement grammar = item.Parent.PreviousNode as XElement;
                grammar.Remove();
                grammar = item.Parent.NextNode as XElement;
                grammar.Remove();
                XElement next = (item.Parent.NextNode as XElement).Element(ns + "t");
                string totalTagName = string.Format("{0}{1}", item.Value, next.Value);
                item.Parent.NextNode.Remove();
                item.Value = totalTagName;
                string tagName = reg.Match(totalTagName).Groups["TagName"].Value;
                if (lookUpTable.ContainsKey(tagName))
                {
                    lookUpTable[tagName].Add(item);
                }
                else
                {
                    lookUpTable.Add(tagName, new List<XElement> { item });
                }
            }
        }

        private bool NodeGotSplitUpIn2PartsDueToGrammarCheck(XElement node)
        {
            XNamespace ns = node.Document.Root.GetNamespaceOfPrefix("w");
            return node.Value.StartsWith("<!") && ((XElement)node.Parent.PreviousNode).Name == ns + "proofErr";
        }


        private void ExtractDocument()
        {
            if (!Directory.Exists("temp"))
            {
                Directory.CreateDirectory("temp");
            }
            else
            {
                Directory.Delete("temp",true);
                Directory.CreateDirectory("temp");
            }
            ZipFile.ExtractToDirectory(fileName, "temp");
        }

        #endregion
    }
}

并像这样使用它:

class Program
{
    static void Main(string[] args)
    {
        MyWordDocument doc = new MyWordDocument("somedoc.docx"); //todo: fix path

        foreach (string name in doc.Tags) //name would be the extracted name from the placeholder
        {
            doc.ReplaceTagWithValue(name, "Example");
        }

        doc.Save("output.docx"); //todo: fix path
    }
}

答案 2 :(得分:1)

我有同样的需求,除了我想使用${...}条目而不是<!...!>。您可以自定义下面的代码以使用您的代码,但这需要更多状态。

以下代码适用于xml和openxml节点。我使用xml测试了代码,因为当涉及到word文档时,很难控制word如何排列段落,运行和放大器。文本元素。我想这不是不可能,但这样我有更多的控制权:

static void Main(string[] args)
{
  //FillInValues(FileName("test01.docx"), FileName("test01_out.docx"));

  string[,] tests =
  {
    { "<r><t>${abc</t><t>}$</t><t>{tha}</t></r>", "<r><t>ABC</t><t>THA</t><t></t></r>"},
    { "<r><t>$</t><t>{</t><t>abc</t><t>}</t></r>", "<r><t>ABC</t><t></t></r>"},
    {"<r><t>${abc}</t></r>", "<r><t>ABC</t></r>" },
    {"<r><t>x${abc}</t></r>", "<r><t>xABC</t></r>" },
    {"<r><t>x${abc}y</t></r>", "<r><t>xABCy</t></r>" },
    {"<r><t>x${abc}${tha}z</t></r>", "<r><t>xABCTHAz</t></r>" },
    {"<r><t>x${abc}u${tha}z</t></r>", "<r><t>xABCuTHAz</t></r>" },
    {"<r><t>x${ab</t><t>c}u</t></r>", "<r><t>xABC</t><t>u</t></r>" },
    {"<r><t>x${ab</t><t>yupeekaiiei</t><t>c}u</t></r>", "<r><t>xABYUPEEKAIIEIC</t><t>u</t></r>" },
    {"<r><t>x${ab</t><t>yupeekaiiei</t><t>}</t></r>", "<r><t>xABYUPEEKAIIEI</t><t></t></r>" },

  };


  for (int i = 0; i < tests.GetLength(0); i++)
  {
    string value = tests[i, 0];
    string expectedValue = tests[i, 1];
    string actualValue = Test(value);
    Console.WriteLine($"{value} => {actualValue} == {expectedValue} = {actualValue == expectedValue}");

  }

  Console.WriteLine("Done!");
  Console.ReadLine();
}


public interface ITextReplacer
{
  string ReplaceValue(string value);
}

public class DefaultTextReplacer : ITextReplacer
{
  public string ReplaceValue(string value) { return $"{value.ToUpper()}"; }
}

public interface ITextElement
{
  string Value { get; set; }
  void RemoveFromParent();
}


public class XElementWrapper : ITextElement
{
  private XElement _element;

  public XElementWrapper(XElement element) { _element = element; }

  string ITextElement.Value
  {
    get { return _element.Value; }
    set { _element.Value = value; }
  }

  public XElement Element
  {
    get { return _element; }
    set { _element = value; }
  }

  public void RemoveFromParent()
  {
    _element.Remove();
  }


}

public class OpenXmlTextWrapper : ITextElement
{
  private Text _text;
  public OpenXmlTextWrapper(Text text) { _text = text; }

  public string Value
  {
    get { return _text.Text; }
    set { _text.Text = value; }
  }

  public Text Text
  {
    get { return _text; }
    set { _text = value; }
  }

  public void RemoveFromParent() { _text.Remove(); }
}


private static void FillInValues(string sourceFileName, string destFileName)
{
  File.Copy(sourceFileName, destFileName, true);

  using (WordprocessingDocument doc =
    WordprocessingDocument.Open(destFileName, true))
  {
    var body = doc.MainDocumentPart.Document.Body;
    var paras = body.Descendants<Paragraph>();

    SimpleStateMachine stateMachine = new SimpleStateMachine();

    //stateMachine.TextReplacer = <your implementation object >
    ProcessParagraphs(paras, stateMachine);
  }
}

private static void ProcessParagraphs(IEnumerable<Paragraph> paras, SimpleStateMachine stateMachine)
{
  foreach (var para in paras)
  {
    foreach (var run in para.Elements<Run>())
    {
      //Console.WriteLine("New run:");

      var texts = run.Elements<Text>().ToArray();

      for (int k = 0; k < texts.Length; k++)
      {
        OpenXmlTextWrapper wrapper = new OpenXmlTextWrapper(texts[k]);
        stateMachine.HandleText(wrapper);
      }
    }
  }
}

public class SimpleStateMachine
{
  // 0 - outside - initial state
  // 1 - $ matched
  // 2 - ${ matched
  // 3 - } - final state

  // 0 -> 1 $
  // 0 -> 0 anything other than $
  // 1 -> 2 {
  // 1 -> 0 anything other than {
  // 2 -> 3 }
  // 2 -> 2 anything other than }
  // 3 -> 0

  public ITextReplacer TextReplacer { get; set; } = new DefaultTextReplacer();
  public int State { get; set; } = 0;
  public List<ITextElement> TextsList { get; } = new List<ITextElement>();
  public StringBuilder Buffer { get; } = new StringBuilder();


  /// <summary>
  /// The index inside the Text element where the $ is found
  /// </summary>
  public int Position { get; set; }

  public void Reset()
  {
    State = 0;
    TextsList.Clear();
    Buffer.Clear();
  }

  public void Add(ITextElement text)
  {
    if (TextsList.Count == 0 || TextsList.Last() != text)
    {
      TextsList.Add(text);
    }
  }

  public void HandleText(ITextElement text)
  {
    // Scan the characters

    for (int i = 0; i < text.Value.Length; i++)
    {
      char c = text.Value[i];

      switch (State)
      {
        case 0:
          if (c == '$')
          {
            State = 1;
            Position = i;
            Add(text);
          }
          break;
        case 1:
          if (c == '{')
          {
            State = 2;
            Add(text);
          }
          else
          {
            Reset();
          }
          break;
        case 2:
          if (c == '}')
          {
            Add(text);

            Console.WriteLine("Found: " + Buffer);
            // We are on the final State
            // I will use the first text in the stack and discard the others


            // Here I am going to distinguish between whether I have only one item or more
            if (TextsList.Count == 1)
            {
              // Happy path - we have only one item - set the replacement value and then continue scanning
              string prefix = TextsList[0].Value.Substring(0, Position) + TextReplacer.ReplaceValue(Buffer.ToString());
              // Set the current index to point to the end of the prefix.The program will continue to with the next items
              TextsList[0].Value = prefix + TextsList[0].Value.Substring(i + 1);
              i = prefix.Length - 1;
              Reset();
            }
            else
            {
              // We have more than one item - discard the inbetweeners

              for (int j = 1; j < TextsList.Count - 1; j++)
              {
                TextsList[j].RemoveFromParent();
              }

              // I will set the value under the first Text item where the $ was found
              TextsList[0].Value = TextsList[0].Value.Substring(0, Position) + TextReplacer.ReplaceValue(Buffer.ToString());
              // Set the text for the current item to the remaining chars
              text.Value = text.Value.Substring(i + 1);
              i = -1;
              Reset();
            }
          }
          else
          {
            Buffer.Append(c);
            Add(text);
          }
          break;
      }
    }
  }
}

public static string Test(string xml)
{
  XElement root = XElement.Parse(xml);
  SimpleStateMachine stateMachine = new SimpleStateMachine();


  foreach (XElement element in root.Descendants()
    .Where(desc => !desc.Elements().Any()))
  {
    XElementWrapper wrapper = new XElementWrapper(element);
    stateMachine.HandleText(wrapper);
  }

  return root.ToString(SaveOptions.DisableFormatting);
}

我知道我的答案很晚,但对其他人可能有用。还要确保你测试它。我明天会用真实的文件做更多的测试。如果我发现任何错误,我会在这里修复代码,但到目前为止还是那么好。

更新:当${...}占位符放在表格中时,代码不起作用。这是扫描文档的代码(FillInValues函数)的问题。

更新:我更改了代码以扫描所有段落。