Question

我想删除任何标签，例如

<p>hello <namespace:tag : a>hello</namespace:tag></p>

成为

 <p> hello hello </p>

如果由于某种原因正在使用正则表达式，那么最好的方法是什么？这对任何人都有帮助吗？

(<|</)[:]{1,2}[^</>]>

编辑：添加了

Answer 1

绝对使用XML解析器。 Regex should not be used to parse *ML

Answer 2

您不应该将regex用于这些目的，请使用lxml或BeautifulSoup等解析器

>>> import lxml.html as lxht
>>> myString = '<p>hello <namespace:tag : a>hello</namespace:tag></p>'
>>> lxht.fromstring(myString).text_content()
'hello hello'

这是reason为什么你不应该用正则表达式解析html / xml。

Answer 3

如果你只是试图从一些简单的XML中提取纯文本，那么最好（最快，最小的内存占用）就是在数据上运行for循环：

PSEUDOCODE BELOW

bool inMarkup = false;
string text = "";
for each character in data // (dunno what you're reading from)
{
    char c = current;
    if( c == '<' ) inMarkup = true;
    else if( c == '>') inMarkup = false;
    else if( !inMarkup ) text += c;
}

注意：如果在解析过程中遇到CDATA，JavaScript或CSS等问题，这将会中断。

所以，总结一下......如果它很简单，就要做上面的事情而不是正则表达式。如果不那么简单，请听其他人使用高级解析器。

Answer 4

这是我个人用于java中同样问题的解决方案。用于此的库是Jsoup：http://jsoup.org/。

在我的特定情况下，我必须打开具有特定值属性的标签。您看到此代码中反映出来的情况，它不是解决此问题的确切方法，但可能会让您继续前进。

  public static String unWrapTag(String html, String tagName, String attribute, String matchRegEx) {
    Validate.notNull(html, "html must be non null");
    Validate.isTrue(StringUtils.isNotBlank(tagName), "tagName must be non blank");
    if (StringUtils.isNotBlank(attribute)) {
      Validate.notNull(matchRegEx, "matchRegEx must be non null when an attribute is provided");
    }    
    Document doc = Jsoup.parse(html);
    OutputSettings outputSettings = doc.outputSettings();
    outputSettings.prettyPrint(false);
    Elements elements = doc.getElementsByTag(tagName);
    for (Element element : elements) {
      if(StringUtils.isBlank(attribute)){
        element.unwrap();
      }else{
        String attr = element.attr(attribute);
        if(!StringUtils.isBlank(attr)){
          String newData = attr.replaceAll(matchRegEx, "");
          if(StringUtils.isBlank(newData)){
            element.unwrap();
          }
        }        
      }
    }
    return doc.html();
  }

Java Regex还是XML解析器？

4 个答案: