如何在正则表达式中添加多于1个字符的字符串替换表达式vb.net

时间:2013-08-23 09:12:34

标签: xml regex vb.net

是的,我要从我从维基百科下载的xml文件中删除一些引号。到目前为止文本看起来像这样(忽略换行符,这样更容易阅读):

'''Anarchism''' is a political philosophy that advocates stateless societies based on 
non-hierarchical free associations.<ref name="iaf-ifa.org"/><ref>"That is why 
Anarchy, when it works to destroy authority in all its aspects, when it demands
 the abrogation of laws and the abolition of the mechanism that serves to
 impose them, when it refuses all hierarchical organization and preaches free agreement - at the same time strives to maintain and enlarge the precious kernel of social customs without which
 no human or animal society can exist." Peter Kropotkin. http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin__Anarchism__its_philosophy_and_ideal.html
 Anarchism: its philosophy and ideal</ref><ref>"anarchists are opposed to irrational (e.g., illegitimate) 
authority, in other words, hierarchy - hierarchy being the institutionalisation of authority 
within a society." http://www.theanarchistlibrary.org/HTML/The_Anarchist_FAQ_Editorial_Collective__An_Anarchist_FAQ__03_17_.html#toc2 "B.1 
Why are anarchists against authority and hierarchy?" in An 
Anarchist FAQ</ref><ref>"ANARCHISM, a social philosophy that rejects
 authoritarian government and maintains that voluntary institutions are best
 suited to express man's natural social tendencies." George Woodcock. "Anarchism" at The Encyclopedia of Philosophy</ref><ref>"In a society developed on these lines, the voluntary 
associations which already now begin to cover all the fields of human activity
 would take a still greater extension so as to substitute themselves for the 
state in all its functions." http://www.theanarchistlibrary.org/HTML/Petr_Kropotkin___Anarchism__from_the_Encyclopaedia_Britannica.html
 Peter Kropotkin. "Anarchism" from the Encyclopædia Britannica</ref> Anarchism holds the state
 to be undesirable, unnecessary, or harmful

我想从这个文本块中得到的是:

  

无政府主义是一种政治哲学,主张基于非等级自由联想的无国籍社会。无政府主义认为国家是不受欢迎的,不必要的或有害的。

在我看来,如果我删除"<ref""/ref>"之间的所有文字,我应该能够捕获所有必需的不良文本并将其删除。这是我目前的代码:

        Dim temptext As String = newsrt.ToString
        Dim expression As New Regex("(?<=\<ref)[^/ref>]+(?=/ref>)")
        Dim resul As String = expression.Replace(temptext, "")

但这似乎不起作用。 <ref/ref>之间没有文字被捕获并替换为“”。

任何帮助或建议都会很棒!感谢。

1 个答案:

答案 0 :(得分:2)

这不是否定字符类的工作方式。该类不允许任何一个字符/ref>。此外,您根本不想排除/ref>,因为您还要删除所有中间ref。您只需使用.*即可。此外,您不需要外观,因为它们会从匹配中排除匹配内容。但是你也希望删除这些标签。因此,在您的情况下,它应该简单:

"<ref.*/ref>"

由于*是贪婪的,所以此匹配将从第一个<ref转到最后一个/ref> - 通常是一个贪婪的大问题,但在您的特定情况下,确切地说是您想要的。

您可能希望使用RegexOptions.Singleline,以便.匹配换行符(如果有)。