可能重复:
Wikipedia : Java library to remove wikipedia text markup removal
我必须清理一些来自Confluence的内容。那内容几乎干净;但是,有一些事情,如:
等等。 我需要编写一个清除所有内容的正则表达式,所以,我做了类似的事情:
String wikiCleanMarkupRegex = "\\\\[(.*?)[\\\\|.*?]?\\\\]|\\\\*(.*?)\\\\*|_(.*?)_";
但这并不能清理一切,我的意思是,如果我在#2中给出链接,我会得到:
[链接|]
这不是我想要的,我想得到“链接”...所以,我需要一次又一次地重新解析字符串,直到找不到其他匹配。
这真的很慢,因为要清理数百万条记录,那么,有没有办法做一次性完成所有的正则表达式?
非常感谢。
答案 0 :(得分:0)
因为它看起来基本上是三种类型的代码格式:斜体,粗体和LINK
我会做一个3遍正则表达式替代品。
根据您提供的输入的优先顺序应为:
/**
* FIRST REMOVE ITALICS, THEN BOLD, THEN URL
*/
public static String cleanWikiFormat(CharSequence sequence) {
return Test.removeUrl(Test.removeBold(Test.removeItalic(sequence)));
}
以下是示例代码:
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class Test {
private static String removeItalic(CharSequence sequence) {
Pattern patt = Pattern.compile("_\\*(.+?)\\*_");
Matcher m = patt.matcher(sequence);
StringBuffer sb = new StringBuffer(sequence.length());
while (m.find()) {
String text = m.group(1);
// ... possibly process 'text' ...
m.appendReplacement(sb, Matcher.quoteReplacement(text));
}
m.appendTail(sb);
return sb.toString();
}
private static String removeBold(CharSequence sequence) {
Pattern patt = Pattern.compile("\\*(.+?)\\*");
Matcher m = patt.matcher(sequence);
StringBuffer sb = new StringBuffer(sequence.length());
while (m.find()) {
String text = m.group(1);
// ... possibly process 'text' ...
m.appendReplacement(sb, Matcher.quoteReplacement(text));
}
m.appendTail(sb);
return sb.toString();
}
private static String removeUrl(CharSequence sequence) {
Pattern patt = Pattern.compile("\\[(.+?)\\|\\]");
Matcher m = patt.matcher(sequence);
StringBuffer sb = new StringBuffer(sequence.length());
while (m.find()) {
String text = m.group(1);
// ... possibly process 'text' ...
m.appendReplacement(sb, Matcher.quoteReplacement(text));
}
m.appendTail(sb);
return sb.toString();
}
public static String cleanWikiFormat(CharSequence sequence) {
return Test.removeUrl(Test.removeBold(Test.removeItalic(sequence)));
}
public static void main(String[] args) {
String text = "[hello|] this is just a *[test|]* to clean wiki *type* and _*formatting*_";
System.out.println("Original");
System.out.println(text);
text = Test.cleanWikiFormat(text);
System.out.println("CHANGED");
System.out.println(text);
}
}
以下将给出:
Original
[hello|] this is just a *[test|]* to clean wiki *type* and _*formatting*_
CHANGED
hello this is just a test to clean wiki type and formatting