Jsoup.parse在19个以上的Android设备上花费的时间多了10倍

时间:2013-12-29 21:14:22

标签: android web-scraping jsoup android-4.4-kitkat

出于某种原因,使用Jsoup.parse在kitkat设备上花费的时间比旧设备多10倍,起初我认为它与ART运行时有关,但是改回dalvik没有帮助

以下是我正在使用的代码:

        downloadedHtml = NetworkHelper.downloadString("https://en.m.wikipedia.org/wiki/Dusseldorf");

        AppLog.i("Downloaded data, Jsoup is parsing the html");

        hDoc = Jsoup.parse(downloadedHtml);

        Element htmlElement = hDoc.select("html").first();

        String langCode = htmlElement.attributes().get("lang");

        ArticleInfo articleInfo = new ArticleInfo(getWikiLanguage(langCode), langCode, href);

        article = new Article(articleInfo, href);

        String title = hDoc.getElementById("section_0").text();

        article.set_title(title);

        Document documentNode = hDoc.ownerDocument(); 

        Elements contents = documentNode.getElementsByClass("content");

        if (contents == null || contents.isEmpty())
            throw new IllegalArgumentException("content");

        Element content = contents.first();

        Elements imgElements = content.select("img");

        Element htmlNode;

        for (int i = 0; i < imgElements.size(); i++)
        {
            htmlNode = imgElements.get(i);

            if (!htmlNode.hasAttr("src"))
                continue;

            String src = htmlNode.attr("src");

            if (src.startsWith("//"))
                htmlNode.attr("src", String.format("http:%s", src));
            //else
            //throw new UnsupportedOperationException();
        }

        //get section headings

        Elements headlines = documentNode.getElementsByClass("mw-headline");

        if (headlines != null)
        {
            Element headline;

            for (int i = 0; i < headlines.size(); i++)
            {
                headline = headlines.get(i);

                String headline_link = headline.id();
                String headline_title = headline.text();

                SectionHeadline sectionHeadline = new SectionHeadline(headline_title, headline_link);
                article.get_sectionHeadlines().add(sectionHeadline);
            }
        }

        article.set_html(content.outerHtml());

        //get languages
        //language list

        Element languageSection = content.getElementById("mw-mf-language-section");

        if (languageSection != null)
        {
            Elements languageLinks = languageSection.select("li");

            Element languageLink;

            for (int i = 0; i < languageLinks.size(); i++)
            {
                languageLink = languageLinks.get(i);

                Element link = null;
                Elements ls = languageLink.select("a");

                if (ls == null || ls.size() == 0)
                    continue;

                link = ls.first();

                if (!link.hasAttr("href"))
                    continue;

                String linkHref = link.attr("href");

                if (linkHref != null && link.text() != null)
                {
                    String languageCode = link.attr("lang");

                    if (linkHref.startsWith("//"))
                        linkHref = String.format("http:%s", linkHref);

                    ArticleInfo languageInfo = new ArticleInfo(getWikiLanguage(languageCode), languageCode, linkHref);

                    if (languageInfo.get_language() == "Unknown")
                        continue;

                    article.get_languages().add(languageInfo);
                }
            }

        }

任何想法可能是什么问题?

1 个答案:

答案 0 :(得分:0)

问题中的代码选择文档的一部分,将其保存到变量,选择该变量的一部分,将其保存到新变量,等等。另一种可能的实现是更多地使用selector syntax来仅选择所需的元素,而不是将这些中间步骤保存在新对象中。

以下代码在我的机器上执行2秒钟。上述类似的摘录在约4秒内执行。随后的时间更接近,差异大约50毫秒,所以拿一粒盐。

我不知道kitkat是否存在性能问题。您可能会发现在kitkat和dalvik版本中添加计时器有助于隔离性能瓶颈的存在和位置。

这是我的代码:

long start = System.currentTimeMillis();
Document hDoc = Jsoup.
    connect("https://en.m.wikipedia.org/wiki/Dusseldorf").
    userAgent("Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.17 (KHTML, like Gecko) Chrome/24.0.1312.57 Safari/537.17").
    get();

//select the first html element, then take the value of the lang attribute
String langCode = hDoc.select("html:eq(0)").attr("lang");
String title = hDoc.getElementById("section_0").text();

Document documentNode = hDoc.ownerDocument();

//select all the image elements having the attribute src which are
//descended from the first element with the content class
Elements imgElementsHavingSrcAttr = documentNode.select("*.content:eq(0) img[src]");
Element htmlNode;

//for each img element
for (Element img : imgElementsHavingSrcAttr)
{
    htmlNode = img;
    String src = img.attr("src");

    if (src.startsWith("//"))
    {
        htmlNode.attr("src", String.format("http:%s", src));
    }
}
System.out.println("Function took " + (System.currentTimeMillis()-start) + "ms");