如何在Jsoup中获取第n个前一个或下一个元素

时间:2017-09-10 17:16:10

标签: java web-scraping jsoup html-parsing

是否有使用jsoup获取可能位于不同嵌套级别的第n个上一个或下一个特定HTML元素?

HTML示例:



<div style="position: relative;">
  <div class="wmd-container">
    <div id="wmd-button-bar-42" class="wmd-button-bar"></div>
    <input id="previousInput" name="communitymode" type="checkbox">
  </div>
</div>

<div class="fl" style="margin-top: 8px; height: 24px;">&nbsp;</div>
<div id="draft-saved-42" class="draft-saved community-option fl" style="margin-top: 8px; height: 24px; display: none;">draft saved
</div>

<div id="draft-discarded-42">draft discarded</div>

<div class="community-option g-row ai-center f-checkbox">
  <div class="g-col -input">
    <input id="NextInput" name="communitymode">
  </div>
  <div class="g-col">
    <label for="communitymode-42">community wiki</label>
  </div>
</div>
&#13;
&#13;
&#13;

例如,在下面的HTML中,我指的是元素:

&#13;
&#13;
<div id="draft-discarded-42">draft discarded</div>
&#13;
&#13;
&#13;

使用以下代码。

Element elem = doc.select("div[id=draft-discarded-42]").first();

我想要先前的第一个 input元素:

&#13;
&#13;
<input id="previousInput" name="communitymode" type="checkbox">
&#13;
&#13;
&#13;

第二个上一个 div

&#13;
&#13;
<div class="fl" style="margin-top: 8px; height: 24px;">&nbsp;</div>
&#13;
&#13;
&#13;

次秒 div

&#13;
&#13;
<div class="g-col -input">
  <input id="NextInput" name="communitymode">
</div>
&#13;
&#13;
&#13;

1 个答案:

答案 0 :(得分:0)

除非您不知道id属性的值或可用于标识元素的任何属性,否则应使用选择器语法来获取所需的元素。

但是,如果您有一个模糊的想法/不了解元素的属性,但知道它与指向元素相关的事件,您可以使用这些函数:

第N次出现与查询匹配的元素:

public static Element selectNthElementBefore(Element origin, String query, int count) {
    Element currentElement = origin;
    Evaluator evaluator = QueryParser.parse(query);
    while ((currentElement = currentElement.previousElementSibling()) != null) {
        int val = 0;
        if (currentElement.is(evaluator)) {
            if (--count == 0) {
                return currentElement;
            }
            val++;
        }
        Elements elems = currentElement.select(query);
        if (elems.size() > val) {
            int childCount = elems.size() - val;
            int diff = count - childCount;

            if (diff == 0) {
                Element prevElement = elems.first();
                currentElement = prevElement.children().select(query).first();
                while (currentElement != prevElement) {
                    if (currentElement == null) {
                        return prevElement;
                    }
                    prevElement = currentElement;
                    currentElement = currentElement.children().select(query).first();
                }
            }
            if (diff > 0) {
                count -= childCount;
            }
            if (diff < 0) {
                return elems.get(childCount - count);
            }
        }
    }

    if (origin.parent() != null && currentElement == null) {
        if (origin.parent().is(evaluator)) {
            if (--count == 0) {
                return origin.parent();
            }
        }
        return selectNthElementBefore(origin.parent(), query, count);
    }
    return currentElement;
}

第二次出现与查询匹配的元素:

public static Element selectNthElementAfter(Element origin, String query, int count) {
    Element currentElement = origin;
    Evaluator evaluator = QueryParser.parse(query);
    while ((currentElement = currentElement.nextElementSibling()) != null) {
        int val = 0;
        if (currentElement.is(evaluator)) {
            if (--count == 0)
                return currentElement;
            val++;
        }
        Elements elems = currentElement.select(query);
        if (elems.size() > val) {
            int childCount = elems.size() - val;
            int diff = count - childCount;

            if (diff == 0) {
                return elems.last();
            }
            if (diff > 0) {
                count -= childCount;
            }
            if (diff < 0) {
                return elems.get(childCount + diff);
            }
        }
    }
    if (origin.parent() != null && currentElement == null) {
        return selectNthElementAfter(origin.parent(), query, count);
    }
    return currentElement;
}

用法:

Element elem = doc.getElementById("draft-discarded-42");

Element firstPrevInput = selectNthElementBefore(elem, "input", 1);
Element secPrevDiv = selectNthElementBefore(elem, "div", 2);
Element secNextDiv = selectNthElementAfter(elem, "div", 2);

System.out.println("#### First previous input ####");
System.out.println(firstPrevInput.toString());
System.out.println("##############################\n"); 
System.out.println("#### Second previous div ####");
System.out.println(secPrevDiv.toString());
System.out.println("#############################\n");
System.out.println("#### Second next div ####");
System.out.println(secNextDiv.toString());
System.out.println("#########################");

输出:

#### First previous input ####
<input id="previousInput" name="communitymode" type="checkbox">
##############################

#### Second previous div ####
<div class="fl" style="margin-top: 8px; height: 24px;">
 &nbsp;
</div>
#############################

#### Second next div ####
<div class="g-col -input"> 
    <input id="NextInput" name="communitymode"> 
</div>
#########################