Question

我的xml有问题。

我需要删除列表中不存在的所有标记

但它不适用于dd代码

这是xml输入

这是我的清单。此列表是仅在xml

中允许的标记

            lstApprove.Add("Styles");
            lstApprove.Add("alto");
            lstApprove.Add("Description");
            lstApprove.Add("MeasurementUnit");
            lstApprove.Add("sourceImageInformation");
            lstApprove.Add("fileName");
            lstApprove.Add("OCRProcessing");
            lstApprove.Add("preProcessingStep");
            lstApprove.Add("processingSoftware");
            lstApprove.Add("softwareCreator");
            lstApprove.Add("softwareName");
            lstApprove.Add("softwareVersion");
            lstApprove.Add("ocrProcessingStep");
            lstApprove.Add("ParagraphStyle");
            lstApprove.Add("Layout");
            lstApprove.Add("Page");
            lstApprove.Add("PrintSpace");
            lstApprove.Add("TextBlock");
            lstApprove.Add("TextLine");
            lstApprove.Add("String");
            lstApprove.Add("SP");
            lstApprove.Add("ComposedBlock");
            lstApprove.Add("GraphicalElement");

以下是删除列表中不存在的标记的代码

                using (StreamReader reader = new StreamReader(xmlFile))
                {

                    nAlto = reader.ReadToEnd();

                    nAlto = nAlto.Replace("<document xmlns=\"http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">", "<document>");
                    nAlto = nAlto.Replace("<?xml version=\"1.0\" encoding=\"UTF-16\"?>", "");
                }


                XDocument doc = XDocument.Parse(nAlto);
                    foreach (var item in doc.Descendants().ToList())
                    {
                        if (!lstApprove.Contains(item.Name.ToString()))
                        {
                            if (item.HasElements)
                            {
                                item.ReplaceWith(item.Elements());
                            }
                            else
                            {
                                item.Remove();
                            }
                        }

                    }

这是输出

http://pastebin.com/XjYBTWci

这是xml输出的一部分

<dd l="2342.29" t="133.12" r="2427.71" b="209.17">
          <TextBlock ID="P1_TB0000001" TAGREFS="LAYOUT_TAG_001" HPOS="2349.17" VPOS="160" WIDTH="71.66" HEIGHT="36.04" STYLEREFS="PAR_LEFT">
            <TextLine ID="P1_TL0000001" HPOS="2362.92" VPOS="160" WIDTH="44.16" HEIGHT="36.04">
              <String ID="P1_ST0000001" HPOS="2362.92" VPOS="160" WIDTH="44.16" HEIGHT="36.04" CONTENT="43" />
            </TextLine>
          </TextBlock>
        </dd>

我仍然拥有dd标记，即使它不在我的列表中。为什么？谢谢

Answer 1

问题是当你用它的元素替换item时，Item被从xDocument中移除，所以它的子节点也是如此，所以当你试图删除被移除元素的任何子节点时，{{1}它与xDocument分离，因此它对它没有任何影响。要解决此问题，您需要存储要删除的元素的父级，然后递归迭代其子级

<dd>

尝试此功能public static void RemoveRecursive(XElement current, List<string> goodNames) { var parent = current; if (!goodNames.Contains(current.Name.ToString())) { parent = current.Parent; current.ReplaceWith(current.Elements()); } foreach (var element in parent.Elements()) { RemoveRecursive(element, goodNames); } }为current且doc.Root为goodNames

Answer 2

我认为你的代码的问题是你错过了替换中“/ document”的结束正斜杠。工作原理

using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;

namespace ConsoleApplication1
{
    class Program
    {
        const string FILENAME = @"c:\temp\test.xml";
        static void Main(string[] args)
        {
            List<string> lstApprove = new List<string>() {
                "Styles", "alto", "Description", "MeasurementUnit",
                "sourceImageInformation",
                "fileName", "OCRProcessing", "preProcessingStep",
                "processingSoftware", "softwareCreator", "softwareName",
                "softwareVersion", "ocrProcessingStep", "ParagraphStyle",
                "Layout", "Page", "PrintSpace", "TextBlock",
                "TextLine", "String", "SP", "ComposedBlock", "GraphicalElement"
            };


            XDocument doc = XDocument.Load(FILENAME);

            List<XElement> elements = doc.Descendants().Where(x => lstApprove.Contains(x.Name.LocalName)).ToList();

            string xml = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>" +
                         "<document xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"></document>";

            XDocument newDoc = XDocument.Parse(xml);
            XElement document = (XElement)newDoc.FirstNode;
            document.Add(elements);
        }
    }
}

xml linq无法删除该元素

2 个答案: