java中的正则表达式可以捕获并删除此模式?

时间:2011-11-02 00:42:01

标签: java regex

假设我有几行维基百科XML,如下所示:

  

[[图片来源:ChicagoAnarchists.jpg | thumb |一个有同情心的雕刻   [[沃尔特克兰]]后执行的“芝加哥无政府主义者”   [[Haymarket事务]]。 Haymarket事件通常被认为是   国际[[五一]]起源最重要的事件   [纪念] 1907年,[国际无政府主义者大会]   阿姆斯特丹]]聚集了来自14个不同国家的代表   无政府主义运动的哪些重要人物,包括[[Errico   马拉泰斯塔]

我想删除以[[Image:" and closed by "observances]]开头的行。 可能还有其他几行文本也有括号,我不想做贪婪的搜索,否则也可能意外删除其他括号。

例如,如果我只是做了一个贪婪的\\[\\[Image:.*\\]\\],我相信它会删除最后一个右括号(Ericco Malatesta)的所有内容

是否有正则表达式可以让我更容易?

5 个答案:

答案 0 :(得分:2)

让我们看看...使用懒惰重复而不是贪婪怎么样?

\[\[Image:.*?observances\]\]

答案 1 :(得分:0)

这个例子怎么了?

s.replaceAll("(\\[{2}Image:(?:(?:\\[{2}).*\\]{2}|[^\\[])*\\]{2})", "");

仅替换此文字:

  • [[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]

答案 2 :(得分:0)

这有效:

str.replaceAll("^\\[\\[([^\\[]*?(\\[\\[[^\\]]*\\]\\])?[^\\[]*?)*?\\]\\]\\s*", "");

输入输出:

In 1907, the [[International...

这是有效的,因为它正在寻找匹配的[[]](以及周围的文字)里面的第一对

答案 3 :(得分:0)

也许是这样的:

(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])

我试过

public class My {

public static void main(String[] args) {
    String foo = "[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]";
    Matcher m = Pattern.compile("(.*?\\[\\[[^\\[]*?\\]\\][^\\[]*\\]\\])").matcher(foo);
    while (m.find()) {
        System.out.print(m.group(1));
    }
}}

打印

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed "Anarchists of Chicago" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]]

希望这会有所帮助:D

答案 4 :(得分:0)

使用以下测试字符串(注意,我在其中添加了一个[[image:foobar[[foo [baz] bar]]foobar]]):

[[Image:ChicagoAnarchists.jpg|thumb|A sympathetic engraving by [[Walter Crane]] of the executed \"Anarchists of Chicago\" after the [[Haymarket affair]]. The Haymarket affair is generally considered the most significant event for the origin of international [[May Day]] observances]] In 1907, the [[International Anarchist Congress of[[image:foobar[[foo [baz] bar]]foobar]] Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

正则表达式:

(?i)\\[\\[image:(?:\\[\\[(?:(?!(?:\\[\\[|]])).)*]]|(?:(?!(?:\\[\\[|]])).)*?)*?]]

testString.replaceAll(<above pattern>, "")将返回:

 In 1907, the [[International Anarchist Congress of Amsterdam]] gathered delegates from 14 different countries, among which important figures of the anarchist movement, including [[Errico Malatesta]]

以下是正则表达式的更详细说明:

(?i)                    # Case insensitive flag
\[\[image:              # Match literal characters '[[image:'
(?:                     # Begin non-capturing group
  \[\[                  # Match literal characters '[['
  (?:                   # Begin non-capturing group
    (?!                 # Begin non-capturing negative look-ahead group
      (?:               # Begin non-capturing group
        \[\[            # Match literal characters '[['
        |               # Match previous atom or next atom
        ]]              # Match literal characters ']]'
      )                 # End non-capturing group
    )                   # End non-capturing negative look-ahead group
    .                   # Match any character
  )                     # End non-capturing group
  *                     # Match previous atom zero or more times
  ]]                    # Match literal characters ']]'
  |                     # Match previous atom or next atom
  (?:                   # Begin non-capturing group
    (?!                 # Begin non-capturing negative look-ahead group
      (?:               # Begin non-capturing group
        \[\[            # Match literal characters '[['
        |               # Match previous atom or next atom
        ]]              # Match literal characters ']]'
      )                 # End non-capturing group
    )                   # End non-capturing negative look-ahead group
    .                   # Match any character
  )                     # End non-capturing group
  *?                    # Reluctantly match previous atom zero or more times
)                       # End non-capturing group
*?                      # Reluctantly match previous atom zero or more times
]]                      # Match literal characters ']]'

这只会处理一级嵌套[[...]]模式。正如this answerthis question TJR所述,正则表达式不会处理无限制的嵌套原子。因此,此正则表达式模式与[[foo[[baz]]bar]]字符串中的[[image:...]]不匹配。

要获得精彩的正则表达式参考,请参阅Regular-Expressions.info