我的xml有问题。
我需要删除列表中不存在的所有标记
但它不适用于dd
代码
这是xml输入
这是我的清单。此列表是仅在xml
中允许的标记 lstApprove.Add("Styles");
lstApprove.Add("alto");
lstApprove.Add("Description");
lstApprove.Add("MeasurementUnit");
lstApprove.Add("sourceImageInformation");
lstApprove.Add("fileName");
lstApprove.Add("OCRProcessing");
lstApprove.Add("preProcessingStep");
lstApprove.Add("processingSoftware");
lstApprove.Add("softwareCreator");
lstApprove.Add("softwareName");
lstApprove.Add("softwareVersion");
lstApprove.Add("ocrProcessingStep");
lstApprove.Add("ParagraphStyle");
lstApprove.Add("Layout");
lstApprove.Add("Page");
lstApprove.Add("PrintSpace");
lstApprove.Add("TextBlock");
lstApprove.Add("TextLine");
lstApprove.Add("String");
lstApprove.Add("SP");
lstApprove.Add("ComposedBlock");
lstApprove.Add("GraphicalElement");
以下是删除列表中不存在的标记的代码
using (StreamReader reader = new StreamReader(xmlFile))
{
nAlto = reader.ReadToEnd();
nAlto = nAlto.Replace("<document xmlns=\"http://www.scansoft.com/omnipage/xml/ssdoc-schema3.xsd\" xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\">", "<document>");
nAlto = nAlto.Replace("<?xml version=\"1.0\" encoding=\"UTF-16\"?>", "");
}
XDocument doc = XDocument.Parse(nAlto);
foreach (var item in doc.Descendants().ToList())
{
if (!lstApprove.Contains(item.Name.ToString()))
{
if (item.HasElements)
{
item.ReplaceWith(item.Elements());
}
else
{
item.Remove();
}
}
}
这是输出
这是xml输出的一部分
<dd l="2342.29" t="133.12" r="2427.71" b="209.17">
<TextBlock ID="P1_TB0000001" TAGREFS="LAYOUT_TAG_001" HPOS="2349.17" VPOS="160" WIDTH="71.66" HEIGHT="36.04" STYLEREFS="PAR_LEFT">
<TextLine ID="P1_TL0000001" HPOS="2362.92" VPOS="160" WIDTH="44.16" HEIGHT="36.04">
<String ID="P1_ST0000001" HPOS="2362.92" VPOS="160" WIDTH="44.16" HEIGHT="36.04" CONTENT="43" />
</TextLine>
</TextBlock>
</dd>
我仍然拥有dd
标记,即使它不在我的列表中。为什么?谢谢
答案 0 :(得分:1)
问题是当你用它的元素替换item时,Item被从xDocument中移除,所以它的子节点也是如此,所以当你试图删除被移除元素的任何子节点时,{{1}它与xDocument分离,因此它对它没有任何影响。要解决此问题,您需要存储要删除的元素的父级,然后递归迭代其子级
<dd>
尝试此功能 public static void RemoveRecursive(XElement current, List<string> goodNames)
{
var parent = current;
if (!goodNames.Contains(current.Name.ToString()))
{
parent = current.Parent;
current.ReplaceWith(current.Elements());
}
foreach (var element in parent.Elements())
{
RemoveRecursive(element, goodNames);
}
}
为current
且doc.Root
为goodNames
答案 1 :(得分:0)
我认为你的代码的问题是你错过了替换中“/ document”的结束正斜杠。工作原理
using System;
using System.Collections.Generic;
using System.Linq;
using System.Text;
using System.Xml;
using System.Xml.Linq;
namespace ConsoleApplication1
{
class Program
{
const string FILENAME = @"c:\temp\test.xml";
static void Main(string[] args)
{
List<string> lstApprove = new List<string>() {
"Styles", "alto", "Description", "MeasurementUnit",
"sourceImageInformation",
"fileName", "OCRProcessing", "preProcessingStep",
"processingSoftware", "softwareCreator", "softwareName",
"softwareVersion", "ocrProcessingStep", "ParagraphStyle",
"Layout", "Page", "PrintSpace", "TextBlock",
"TextLine", "String", "SP", "ComposedBlock", "GraphicalElement"
};
XDocument doc = XDocument.Load(FILENAME);
List<XElement> elements = doc.Descendants().Where(x => lstApprove.Contains(x.Name.LocalName)).ToList();
string xml = "<?xml version=\"1.0\" encoding=\"UTF-16\"?>" +
"<document xmlns:xsi=\"http://www.w3.org/2001/XMLSchema-instance\"></document>";
XDocument newDoc = XDocument.Parse(xml);
XElement document = (XElement)newDoc.FirstNode;
document.Add(elements);
}
}
}