从URL中获取标签之间的所有文本内容?

时间:2014-10-15 18:03:08

标签: java html dom html-parsing jsoup

通过URL链接。例如:http://www.engineersireland.ie/home.aspx

我可以使用java.net.URL或Jsoup中内置的java来读取它们。

然后,我需要在标记之后提取标记之间的所有文本内容。

标签内会有标签。我所需要的只是中间的文字。

例如:

    <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
     <head id="head"><title>
        Engineers Ireland - Home
     </title><meta http-equiv="content-type" content="text/html; charset=UTF-8" /> 
    <meta http-equiv="pragma" content="no-cache" /> 
    <meta http-equiv="content-style-type" content="text/css" /> 
    <meta http-equiv="content-script-type" content="text/javascript" /> 

    <link href="/favicon.ico" type="image/x-icon" rel="shortcut icon"/> 
    <link href="/favicon.ico" type="image/x-icon" rel="icon"/>
<body>
<div class="module-content">

        <p id="1">Members can login for access to exclusive content, event booking, shop discounts and more...</p>

            <fieldset>
                <legend>Your Login Details</legend>
                <div class="formline">
                    <label for="1" id="1">Your Membership Number</label>
                    <input name="1" type="text" id="1" title="Your Membership Number" class="login-username clearlabel" />
                    <span id="1e" class="ErrorLabel" style="display:none;">Enter your membership number</span>
                </div>
                <div class="formline">
                    <label for="1" id="adasdasd">Password</label>
                    <input name="asdasd" type="password" id="dfbsdf" title="Password" class="login-password clearlabel" />
                    <span id="drthd" class="ErrorLabel" style="display:none;">Enter your password</span>
                </div>
                <div class="formline">
                    <input name="aseresrr" type="checkbox" id="bstg" class="login-remember" />
                    <label for="ryjmf" id="asrats" class="remember">Remember Me</label>

                    <div class="button grey">
                        <input type="submit" name="fgn" value="LOGIN" onclick="sdf;, false, false))" id="sdfsdf" />
                    </div>
                </div>

            </fieldset>
        <ul class="arrow">
            <li><a href="/site/reset-password.aspx">Forgot your password?</a></li>
            <li><a href="/membership/apply.aspx">Haven't registered yet?</a></li>
        </ul>
    </div>
</body>
</html>

从这个HTML代码中,我只需要:

Your Membership Number
Enter your membership number
Password
Enter your password
Remember Me

其他的是,

Keep in mind, the tag names and the number of tag are always random depend on the web page iteself.

有任何帮助吗?通过使用Jsoup或java? THX

2 个答案:

答案 0 :(得分:2)

通过以下内容,您可以通过将正确的CSS查询传递给getStringsFromUrl方法来指定要从中提取文本的文档部分。要搜索整个文档传递为null。

import org.jsoup.Jsoup;
import org.jsoup.helper.StringUtil;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Node;
import org.jsoup.nodes.TextNode;
import org.jsoup.select.Elements;
import org.jsoup.select.NodeVisitor;

import java.io.IOException;
import java.util.ArrayList;
import java.util.List;

public class JSoupTest {
    /*
     Outputs:
        Members can login for access to exclusive content, event booking, shop discounts and more...
        Your Login Details
        Your Membership Number
        Enter your membership number
        Password
        Enter your password
        Remember Me
        Forgot your password?
        Haven't registered yet?
     */
    public static void main(String[] args) throws IOException {
        String url = "http://localhost/test.html";
        List<String> strings = getStringsFromUrl(url, null);
        for(String string : strings) {
            System.out.println(string);
        }
    }

    private static List<String> getStringsFromUrl(String url, String cssQuery) throws IOException {
        Document document = Jsoup.connect(url).get();
        Elements elements = StringUtil.isBlank(cssQuery)
                ? document.getElementsByTag("body")
                : document.select(cssQuery);

        List<String> strings = new ArrayList<String>();
        elements.traverse(new TextNodeExtractor(strings));
        return strings;
    }

    private static class TextNodeExtractor implements NodeVisitor {
        private final List<String> strings;

        public TextNodeExtractor(List<String> strings) {
            this.strings = strings;
        }

        @Override
        public void head(Node node, int depth) {
            if (node instanceof TextNode) {
                TextNode textNode = ((TextNode) node);
                String text = textNode.getWholeText();
                if (!StringUtil.isBlank(text)) {
                    strings.add(text);
                }
            }
        }

        @Override
        public void tail(Node node, int depth) {}
    }
}

答案 1 :(得分:0)

在java中使用HtmlUnit库,以便您可以找到所选的标记内容。

请访问以下链接:

http://htmlunit.sourceforge.net/gettingStarted.html