Question

我正在编写一个Android应用程序，它从网站获取相关数据并将其呈现给用户（html抓取）。应用程序下载源代码并对其进行解析，查找要存储在对象中的相关数据。我实际上使用JSoup创建了一个解析器，但事实证明我的应用程序非常慢。此外，这些库往往相当大，我希望我的应用程序是轻量级的。

我试图解析的网页都有类似的结构，我确切地知道我正在寻找什么标签。所以我想我也可以下载源代码并逐行阅读，使用String.equals查找相关数据。例如，如果html看起来像这样：

<textTag class="text">I want this text</textTag>

我会使用以下方法解析它：

private void interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
        String text = s.substring(22, s.length() - 10);
    }
}

但是，我对设置连接知之甚少（我见过人们使用HttpGet，但我不完全确定如何从中获取数据）。我已经搜索了很长时间以寻找有关如何解析的信息，但大多数人经常使用像JSoup，SAX等库来解析。

有没有人碰巧有关于如何解析这样的信息，可能是一个例子？或者以这种方式解析源代码是一个坏主意？请给我你的意见。

感谢您的时间。

Answer 1

要在java中获取网页，您可以在此答案的底部找到代码。

你可以使用reg-expressions。

这是一个很好的参考

android regex

但是，如果html编写得很好，你也可以试试yahoo的yql。它输出为json或xml，这样你就可以轻松地抓住它。

yahoo yql console

个性，我用python或php解析它们，因为我觉得这些语言更舒服。

获取网页：如何使用它：

Get_Webpage obj = new Get_Webpage（“http：// your_url_here”）; Sting source = obj.get_webpage_source（）;

public class Get_Webpage {

    public String parsing_url = "";

    public Get_Webpage(String url_2_get){       
        parsing_url = url_2_get;
    }

    public String get_webpage_source(){

        HttpClient client = new DefaultHttpClient();
        HttpGet request = new HttpGet(parsing_url);
        HttpResponse response = null;
        try {
            response = client.execute(request);
        } catch (ClientProtocolException e) {

        } catch (IOException e) {

        }

        String html = "";
        InputStream in = null;
        try {
            in = response.getEntity().getContent();
        } catch (IllegalStateException e) {

        } catch (IOException e) {

        }
        BufferedReader reader = new BufferedReader(new InputStreamReader(in));
        StringBuilder str = new StringBuilder();
        String line = null;
        try {
            while((line = reader.readLine()) != null)
            {
                str.append(line);
            }
        } catch (IOException e) {

        }
        try {
            in.close();
        } catch (IOException e) {

        }
        html = str.toString();

        return html;
    }

}

Answer 2

以下是我将如何做到这一点：

        StringBuffer text = new StringBuffer();
        HttpURLConnection conn = null;
        InputStreamReader in = null;
        BufferedReader buff = null;
        try {
            URL page = new URL(
                    "http://example.com/");
// URLEncoder.encode(someparameter); use when passing params that may contain symbols or spaces use URLEncoder to encode it and conver space to %20...etc other wise you will get a 404
            conn = (HttpURLConnection) page.openConnection();
            conn.connect();
            /* use this if you need to
            int responseCode = conn.getResponseCode();

            if (responseCode == 401 || responseCode == 403) {
                // Authorization Error
                Log.e(tag, "Authorization Error");
                throw new Exception("Authorization Error");
            }

            if (responseCode >= 500 && responseCode <= 504) {
                // Server Error
                Log.e(tag, "Internal Server Error");
                throw new Exception("Internal Server Error");
            }*/
            in = new InputStreamReader((InputStream) conn.getContent());
            buff = new BufferedReader(in);
            String line = "anything";
            while (line != null) {
                line = buff.readLine();
            String found = interpretHtml(line);
            if(null != found)
                return found; // comment the previous 2 lines and this one if u need to load the whole html document.
                text.append(line + "\n");
            }
        } catch (Exception e) {
            Log.e(Standards.tag,
                    "Exception while getting html from website, exception: "
                            + e.toString() + ", cause: " + e.getCause()
                            + ", message: " + e.getMessage());
        } finally {
            if (null != buff) {
                try {
                    buff.close();
                } catch (IOException e1) {
                }
                buff = null;
            }
            if (null != in) {
                try {
                    in.close();
                } catch (IOException e1) {
                }
                in = null;
            }
            if (null != conn) {
                conn.disconnect();
                conn = null;
            }
        }
        if (text.toString().length() > 0) {
            return interpretHtml(text.toString()); // use this if you don't need to load the whole page.
        } else return null;
    }

private String interpretHtml(String s){
    if(s.startsWidth("<textTag class=\"text\"")){
    return s.substring(22, s.length() - 10);
    }
    return null;
}

Answer 3

我想说如果您遇到性能问题，在设备上解析HTML可能是一个坏主意。您是否考虑过创建设备应用从中提取数据的网络应用？

如果数据来自一个来源（即一个网页，而不是很多），我会建立一个网络应用程序来预取网站，解析相关数据，然后将其缓存以供以后在设备上使用。

在java中解析html为Android应用程序

3 个答案: