如何使用Jsoup在此Html中提取所需的数据

时间:2016-04-24 16:37:36

标签: android html jsoup

我是Jsoup库的新手。这段HTML给我解析了我的地狱。

远程HTML

//Skipped the meta and header because I don't need it.
...
<body class="sin">
<div class="ks">
    <div class="wrap">

        <div class="content-right-sidebar-wrap">
            <main class="content">

                //A lot of unneeded tags

                <article class="post-1989009 post type-post post" itemscope="" itemtype="http://schema.org/CreativeWork">
                    <header class="post-header">
                        <h1 class="post-title" itemprop="headline">Tyh RGB  Marco to habits gtr</h1>
                        <img src="https://ohniee.com/wp-content/uploads/avatars/1/djsy8933e89ufio8389e8-author-img.jpg" class="avatar user-1-avatar avatar-40 photo" width="40" height="40" alt="Profile photo of Johnnie Adams">

                        <div class="entry-meta" style="padding-top:3px; margin-left: 50px">
                        " Written by "<a href="/authors/johnnie"><span class="entry-author" itemprop="author" itemscope="" itemtype="http://schema.org/Person"><span class="entry-author-name" itemprop="name">Johnnie Adams</span></span></a> <script>
                        document.write(" on April 23rd, 2002 11:28 PM")</script>" on April 23rd, 2002 11:28 PM  .  "<span class="entry-comments-link"><a href="https://johniee.com/2002/04/thalo-in-American-film-industryk.html#comments">1 Comment</a></span>
                        </div>
                    </header>

                    //A lot of unneeded tags

                   ...

我正在解析它:

String post_authordate = document.select("div.entry-meta").first().text();
        postAuthorDate.setText(post_authordate);

        Elements img = document.select("img[class=avater]");
        String author_image = img.attr("src");
        postAuthorUrl.setText(author_image);

这就是我得到的

  • 约翰尼亚当斯的Wriiten。 1条评论
  • postAuthorUrl没有 正在展示。

我想要什么

我的代码

private void loadPost() {
        Log.d(TAG, "loadPost called");

        final ProgressBar progressBar;
        progressBar = (ProgressBar) findViewById(R.id.progress_circle);
        progressBar.setVisibility(View.VISIBLE);


        String news_id = getIntent().getStringExtra("PostId");
        Log.d(TAG, "You clicked post id " + news_id);

        StringRequest stringRequest = new StringRequest(news_id,
                new Response.Listener<String>() {
                    @Override
                    public void onResponse(String response) {
                        //Log.d("Debug", response.toString());
                        if (progressBar != null) {
                            progressBar.setVisibility(View.GONE);
                        }
                        parseHtml(response);
                        postData = response;


                    }
                },
                new Response.ErrorListener() {
                    @Override
                    public void onErrorResponse(VolleyError error) {
                        VolleyLog.d("", "Error: " + error.getMessage());

                        if (progressBar != null) {
                            progressBar.setVisibility(View.GONE);
                        }

                        final  AlertDialog.Builder sthWrongAlert = new AlertDialog.Builder(PostDetails.this);
                        sthWrongAlert.setCancelable(false);
                        sthWrongAlert.setMessage(R.string.sth_wrongme_det);
                        sthWrongAlert.setPositiveButton(R.string.alert_retry, new DialogInterface.OnClickListener() {
                            @Override
                            public void onClick(DialogInterface dialog, int which) {
                                if (!NetworkCheck.isAvailableAndConnected(PostDetails.this)) {
                                    internetDialog.show();
                                } else {
                                    loadPost();
                                }

                            }
                        });
                        sthWrongAlert.setNegativeButton(R.string.alert_cancel, new DialogInterface.OnClickListener() {
                            @Override
                            public void onClick(DialogInterface dialog, int which) {
                                finish();
                            }
                        });
                        sthWrongAlert.show();
                    }
                });

        //Creating requestqueue
        RequestQueue requestQueue = Volley.newRequestQueue(this);

        //Adding request queue
        requestQueue.add(stringRequest);


    }

    private void parseHtml(String response) {
        Log.d(TAG, "parsinghtml");
        Document document = Jsoup.parse(response);


        String post_authordate = document.select("div.entry-meta").get(0).text();

        String img = document.select("img.avatar").get(0).attr("src");

        postAuthorDate.setText(post_authordate);

    }

2 个答案:

答案 0 :(得分:1)

尝试使用img[class~=avatar user-(\d+)-avatar avatar-40 photo]代替img[class=avater]

html源代码中的日期是2002年。您想要2016年吗?

如何摆脱“评论1”

System.out.println(entryMetaText.replaceAll("\d+ Comment",""));

<击> System.out.println(entryMetaText.substring(0, entryMetaText.length() - 9);)

答案 1 :(得分:1)

试试这个

这是我阅读和阅读的方式。用html内容解析文件

" Written by "Johnnie Adams " on April 23rd, 2002 11:28 PM
https://ohniee.com/wp-content/uploads/avatars/1/djsy8933e89ufio8389e8-author-img.jpg

输出(为新示例html编辑):

{{1}}

输出的第一行在PM之后剥离任何内容,因此在结尾处缺少额外的句号和引号。