将子字符串替换为动态内容中的另一个字符串

时间:2015-04-29 10:55:21

标签: java regex

我想从html内容中删除一些特定的文字。我在java中使用replaceAll方法用“”替换内容来实现它。

我的内容是

<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA"> or 
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-IE" xml:lang="en-IE"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="es-PR" xml:lang="es-PR> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">

我想删除lang="-" xml:lang="-" 如您所见,lang和xml:lang的值正在动态变化。所以我想要一个可以检测这个特定字符串序列的正则表达式,然后我将在java中使用“replaceAll(regex, string)方法替换它”。

2 个答案:

答案 0 :(得分:3)

这个答案基于

的假设
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA"> or 
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU"> or
...

表示您拥有像

这样的HTML结构
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA">
   ...
</html>

<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU">
   ...
</html>

在这种情况下,使用像Jsoup这样的HTML / XML解析器而不是正则表达式。您的代码可能看起来像

String htmlText = 
        "<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"fr-CA\" xml:lang=\"fr-CA\">" +
        "   <body>hello</body>" +
        "</html>";

//use XML parser if you don't want Jsoup to change optimize your HTML code
Document doc = Jsoup.parse(htmlText,"",Parser.xmlParser());
Elements htmlTag = doc.select("html");
htmlTag.removeAttr("lang").removeAttr("xml:lang");//remove these attributes from selected tag

String replaced = doc.toString();
System.out.println(replaced);

答案 1 :(得分:2)

你可以试试这个:

$strings = <<< LOL
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-IE" xml:lang="en-IE">
<html xmlns="http://www.w3.org/1999/xhtml" lang="es-PR" xml:lang="es-PR">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
LOL;

$strings = preg_replace('/(lang=".*?"|xml:lang=".*?")/', '', $strings);

echo $strings;

输出:

<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >
<html xmlns="http://www.w3.org/1999/xhtml"  >

演示:

http://ideone.com/vhtVcW

正则表达式解释:

(lang=".*?"|xml:lang=".*?")

Match the regex below and capture its match into backreference number 1 «(lang=".*?"|xml:lang=".*?")»
   Match this alternative «lang=".*?"»
      Match the character string “lang="” literally «lang="»
      Match any single character that is NOT a line break character «.*?»
         Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
      Match the character “"” literally «"»
   Or match this alternative «xml:lang=".*?"»
      Match the character string “xml:lang="” literally «xml:lang="»
      Match any single character that is NOT a line break character «.*?»
         Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
      Match the character “"” literally «"»