我想从html内容中删除一些特定的文字。我在java中使用replaceAll方法用“”替换内容来实现它。
我的内容是
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-IE" xml:lang="en-IE"> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="es-PR" xml:lang="es-PR> or
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
我想删除lang="-"
xml:lang="-"
如您所见,lang和xml:lang的值正在动态变化。所以我想要一个可以检测这个特定字符串序列的正则表达式,然后我将在java中使用“replaceAll(regex, string)
方法替换它”。
答案 0 :(得分:3)
这个答案基于
的假设<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA"> or <html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU"> or ...
表示您拥有像
这样的HTML结构<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA">
...
</html>
或
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU">
...
</html>
在这种情况下,使用像Jsoup这样的HTML / XML解析器而不是正则表达式。您的代码可能看起来像
String htmlText =
"<html xmlns=\"http://www.w3.org/1999/xhtml\" lang=\"fr-CA\" xml:lang=\"fr-CA\">" +
" <body>hello</body>" +
"</html>";
//use XML parser if you don't want Jsoup to change optimize your HTML code
Document doc = Jsoup.parse(htmlText,"",Parser.xmlParser());
Elements htmlTag = doc.select("html");
htmlTag.removeAttr("lang").removeAttr("xml:lang");//remove these attributes from selected tag
String replaced = doc.toString();
System.out.println(replaced);
答案 1 :(得分:2)
你可以试试这个:
$strings = <<< LOL
<html xmlns="http://www.w3.org/1999/xhtml" lang="fr-CA" xml:lang="fr-CA">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-AU" xml:lang="en-AU">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-GB" xml:lang="en-GB">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-IE" xml:lang="en-IE">
<html xmlns="http://www.w3.org/1999/xhtml" lang="es-PR" xml:lang="es-PR">
<html xmlns="http://www.w3.org/1999/xhtml" lang="en-US" xml:lang="en-US">
LOL;
$strings = preg_replace('/(lang=".*?"|xml:lang=".*?")/', '', $strings);
echo $strings;
输出:
<html xmlns="http://www.w3.org/1999/xhtml" >
<html xmlns="http://www.w3.org/1999/xhtml" >
<html xmlns="http://www.w3.org/1999/xhtml" >
<html xmlns="http://www.w3.org/1999/xhtml" >
<html xmlns="http://www.w3.org/1999/xhtml" >
<html xmlns="http://www.w3.org/1999/xhtml" >
演示:
正则表达式解释:
(lang=".*?"|xml:lang=".*?")
Match the regex below and capture its match into backreference number 1 «(lang=".*?"|xml:lang=".*?")»
Match this alternative «lang=".*?"»
Match the character string “lang="” literally «lang="»
Match any single character that is NOT a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «"»
Or match this alternative «xml:lang=".*?"»
Match the character string “xml:lang="” literally «xml:lang="»
Match any single character that is NOT a line break character «.*?»
Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the character “"” literally «"»