如何编写正则表达式以在XML文档中查找CDATA标记之外的HTML标记

时间:2015-04-22 23:57:58

标签: html regex xml notepad++

我正在尝试导入一个ONIX(XML)文件,该文件由于描述性文本中的HTML标记而导致导致导入错误。在这个特定的文件中,一些描述性文本包含在CDATA标签中,但似乎有些不是。

如何编写可以找到未包含在CDATA标记中的HTML标记的正则表达式?

我正在使用VB.NET应用程序将数据导入SQL Server数据库,但此时我正在尝试在Notepad ++中编写正则表达式以查看可能的内容。我可以在以后将正则表达式合并到VB代码中。

以下是一些可以正确导入的XML示例:

<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text><![CDATA[More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.]]></Text>
</OtherText>

以下是无法正确导入的XML:

<OtherText>
  <TextTypeCode>01</TextTypeCode>
  <TextFormat>02</TextFormat>
  <Text>More than simply a series of chapters on the theology of John's Gospel, <em>Jesus Is the Christ</em> relates each of John's teachings to his declared aim, expressed in John 20: 30-31: "Jesus did many other signs before his disciples, which have not been written in this book; but these have been written that you may believe that Jesus is the Christ, the Son of God, and that believing you may have life in his name." Indeed, each chapter in Morris's book takes up some facet or aspect of John's expressed aim.<br/><br/>For an age still asking the question "Who is Jesus?" Leon Morris argues convincingly that John's entire Gospel was written to show that the human Jesus is the Christ, or Messiah, as well as the Son of God. But it is Morris's firm conviction that John's purpose was evangelical as well as theological -- that is, John wrote his book so that readers might believe in Christ and as a result have eternal life.</Text>
</OtherText>

现在,

<TextFormat>02</TextFormat> 

表示标签的内容是HTML,所以我可以处理好。当我的标签没有正确标记时,问题就出现了。我需要找到那些,所以我可以纠正它们。

1 个答案:

答案 0 :(得分:0)

这个正则表达式可以帮助你到达某个地方:

<\w+>(?!<![CDATA[)

我在Sublime Text中提供的示例上运行它,它只匹配了CDATA后面没有的HTML标记。