Question

我有一个包含一些HTML代码的字符串。我想知道HTML代码是代表可见文本还是图像。我使用Java来使用以下正则表达式来解决这个问题（我知道你不能使用RegExps解析HTML，但我想我对RegExps的了解就足够了。）

public static String regex_html_tags_1 = "<\\s*br\\s*[/]?>";
public static String regex_html_tags_2 = "<\\s*([a-zA-Z0-9]+)\\s*([^=/>]+\\s*=\\s*[^/>]+\\s*)*\\s*/>"; 
public static String regex_html_tags_3 = "<\\s*([a-zA-Z0-9]+)\\s*([^=>]+\\s*=\\s*[^>]+\\s*)*\\s*>\\s*</\\s*\\1\\s*>"; 

public static String[] HTMLWhiteSpaces = {"&nbsp;", "&#160;"};

使用这些RegExps的代码适用于像

这样的字符串

<h2></h2>

或类似的。但是一个字符串

<img src="someImage.png"></img>

也被认为是空的。

有没有人比使用RegExps更好地了解某些HTML代码在浏览器解释时是否实际代表人类可读文本？或者你认为我的方法最终会取得成功吗？

提前多多感谢。

Answer 1

尝试使用JSoup。它让你使用css选择器（jquery-style）来解析HTML文档。

选择所有非空元素的一个非常简单的例子是：

Document doc = Jsoup.connect("http://my.awesome.site.com").get();
Elements nonEmpties = doc.select(":not(:empty)");

完整的解决方案当然需要做一些额外的工作，比如

迭代元素列表，
检查css样式（适用于display或visibility或尺寸或重叠元素）
检查图片的src属性
等

但绝对值得。您将学习一个新的框架，发现在HTML / CSS中“隐藏”内容的可能性 - 最重要的是 - 停止使用正则表达式进行HTML解析; - ）

Answer 2

我提出了以下代码，在我的设置中可以正常工作，我不需要考虑隐形元素。

// HTML white spaces that might occur in between tags; this list probably needs to be extended
public static String[] HTML_WHITE_SPACES = {"&nbsp;", "&#160;"};

/**
 * check if the given HTML text contains visible text or images
 * 
 * @param htmlText String the text that is checked for visibility
 * @return boolean    (1) true if the htmlText contains some visible elements 
 *                 or (2) false in case (1) does not hold
 */
public static boolean containsVisibleElements(String htmlText) {

    // do not analyze the HTML text if it is blank already
    if (StringUtil.isBlank(htmlText)) {
        return false;
    }

    // the string from which all whitespaces are removed
    String htmlTextRemovedWhiteSpaces = htmlText; 

    // first, remove white spaces from the string
    for (String whiteSpace: HTML_WHITE_SPACES) {
        htmlTextRemovedWhiteSpaces = htmlTextRemovedWhiteSpaces.replaceAll(whiteSpace, "");
    }

    // the HTML text is blank 
    if (StringUtil.isBlank(htmlTextRemovedWhiteSpaces)) {
        return false;
    }

    // parse the HTML text from which the white space have been removed
    Document doc = Jsoup.parse(htmlTextRemovedWhiteSpaces);

    // find real text within the body (and its children)
    String text = doc.body().text(); 

    // there exists visible text
    if (!StringUtil.isBlank(text.trim())) {
        return true;
    }

    // now we know that there does not exist visible text and that the string 
    // htmlTextRemovedWhiteSpaces is not blank

    // look for images as they are visible and not a text ;-)
    Elements images = doc.select("img");

    // there do not exist any image elements
    if (images.isEmpty()) {
        return false;
    }       

    // none of the above checks succeeded, so there must exist some visible elements such as text or images
    return true;
}

找出HTML代码是否代表可见文本/图像

2 个答案: