使用jsoup从html中获取URL

时间:2015-01-06 08:19:41

标签: java android jsoup

我正在尝试使用jsoup获取一个url,以便从该url中下载一个imad,因为它无法正常工作。

我首先想找到的地方" div class =" rg_di" "第一次出现在html文件中, 而不是获取之后的网址:

a href="http://www.google.co.il/imgres?imgurl=http://michellepicker.files.wordpress.com/2011/03/grilled-chicken-mexican-style.jpg&imgrefurl=http://michellepicker.wordpress.com/2011/04/25/grilled-chicken-mexican-style-black-beans-guacamole/&h=522&w=700&tbnid=4hXCtCfljxmJXM:&zoom=1&docid=ajIrwZMUrP5_GM&ei=iVOqVPmDDYrnaJzYgIAM&tbm=isch"

这是html的网址:

视图源:https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955

这是我试过的代码:

try 
        {
            doc = Jsoup.connect(url).get();
            Element link = doc.select("div.rg_di").first();
            Element link2 = link.select("a").first();
            String relHref = link2.attr("href"); // == "/"
            String absHref = link.attr("abs:href");
            tmpResult = absHref;



        } 
        catch (Exception e) 
        {
            Log.e("Error", e.getMessage());
            e.printStackTrace();
        }

完整活动代码:

package com.androidbegin.parselogintutorial;

import com.androidbegin.parselogintutorial.SingleRecipe.urlTask;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.koushikdutta.urlimageviewhelper.sample.UrlImageViewHelperSample;
import com.parse.GetCallback;
import com.parse.ParseException;
import com.parse.ParseObject;
import com.parse.ParseQuery;
import com.parse.ParseUser;
public class Bla extends Activity
{
    ImageView iv,bm;
    TextView recipeTitle;
    String urlForImage = "";
    @Override
    protected void onCreate(Bundle savedInstanceState) 
    {
        // TODO Auto-generated method stub
        super.onCreate(savedInstanceState);
        setContentView(R.layout.bla_layout);
        new urlTask("grilled mexican chicken").execute("grilled mexican chicken");
        //new DownloadImageTask((ImageView)findViewById(R.id.RecipeImage)).execute(urlForImage);
    }
    public class DownloadImageTask extends AsyncTask<String, Void, Bitmap> 
    {
        ImageView bmImage;
        public DownloadImageTask(ImageView bmImage) {
            this.bmImage = bmImage;
        }
        protected Bitmap doInBackground(String... urls) 
        {
            String urldisplay = urls[0];
            Bitmap mIcon11 = null;
            try 
            {
                InputStream in = new java.net.URL(urldisplay).openStream();
                mIcon11 = BitmapFactory.decodeStream(in);
                in.close();
            } 
            catch (Exception e) 
            {
                Log.e("Error", e.getMessage());
                e.printStackTrace();
            }
            return mIcon11;
        }
        protected void onPostExecute(Bitmap result) 
        {
            bmImage.setImageBitmap(result);
        }   
    }
    public class urlTask extends AsyncTask<String, Void, String> 
    {
        String str;
        public urlTask(String str)
        {
            this.str = str;
        }
        String tmpResult = str;
        Document doc;
        protected String doInBackground(String... urls) 
        {
            String urldisplay = urls[0];
            String url = "https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955";
            WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); // Chrome not working
            HtmlPage page = null;
            try 
            {
                page = webClient.getPage(url);
            } catch (FailingHttpStatusCodeException e1) 
            {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            }
            catch (MalformedURLException e1) 
            {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            }
            catch (IOException e1) 
            {
                // TODO Auto-generated catch block
                e1.printStackTrace();
            } 
            try 
            {
                Document doc = Jsoup.parse(page.asXml());
                Elements divs = doc.select(".rg_di");
                for(Element div : divs)
                {
                    Element img = div.select("a").get(0);
                    String link  = img.attr("href");
                    System.out.println(link);
                }

            }
            catch (Exception e) 
            {
                 e.printStackTrace();
            }
            return tmpResult;
        }
        protected void onPostExecute(String result) 
        {
            result = tmpResult;
            urlForImage = tmpResult;
        }   
    }
}

感谢您的帮助

1 个答案:

答案 0 :(得分:4)

我编辑了你的代码以摆脱错误403

而不是:

doc = Jsoup.connect(url).get();

写下这个:

doc = Jsoup.connect(url).userAgent("Mozilla").get();

link似乎是动态生成的。 Jsoup提取不包含 .rg_di 类的html,因此

doc.select("div.rg_di").first();

返回null,我们得到nullpointerexception。

jsoup

下载的html片段
<img height="104" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT-pctOxpuUcdq118aFU3s2miRfUa6Ev8eF-UsxARHV-vbcOUV8byEtt2YT" width="140">

我们所做的最好是获取每个img代码并对其进行迭代,我们会获得图标链接列表

Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements imgs = doc.select("img");
for(Element img : imgs){
    String link  = img.attr("src");
    System.out.println(link);
}

/textinputassistant/tia.png
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT-pctOxpuUcdq118aFU3s2miRfUa6Ev8eF-UsxARHV-vbcOUV8byEtt2YT
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQMq354p43ddqPcpV9-q_05YkmY7XUPgv6Sl2oQLqFxQ5-IkpGAAuFTLMM
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTW-RinkkW_fBdlHzTJn6vNmR85TR58geQgfjQnEJmOqzjq0Oi-z-8zXjg
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRUXLzKi3UyQ6mF9JD20Z1jYNhVxQz7tkhJIEGOL3kua8ptoQrvo8-Nco_X
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTverQlzF_hauCabscWF4wHLb_q7g9M_UDKO6LaldSRHhsTj7CxtVF2yvc
...

有许多解析动态内容的解决方案。 link

编辑1

我实施 htmlunit 来呈现网页

import java.io.IOException;
import java.net.MalformedURLException;

import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;

import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;


public class Main {
    public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
        String url = "https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955";
        WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); // Chrome not working
        HtmlPage page = webClient.getPage(url); 
        try {
            Document doc = Jsoup.parse(page.asXml());
            Elements divs = doc.select(".rg_di");
            for(Element div : divs){
                Element img = div.select("a").get(0);
                String link  = img.attr("href");
                System.out.println(link);
            }
        } catch (Exception e) {
             e.printStackTrace();
        }
    }
}

htmlunit 有自己的html解析api,但我会坚持使用更直观的 jsoup

编辑2

只要你的目标是在Android设备上呈现和解析HTML页面,HTMLUnit不是一个好的选择source

  

HtmlUnit使用Android上不可用的Java类。   最重要的是,HtmlUnit使用了许多其他库,其中一些库可能对这些库有自己的依赖关系。因此,和HmlUnit一样棒,我认为让它在Android上运行可能不是一件容易的事。

  • 您可以尝试this种解决方案。或
  • 你可以折磨自己并尝试this解决方案(你最好不要)。或
  • 如果考虑到this家伙的经验,那么重新设计软件架构会更好:
    1. 创建呈现网页并解析它的java服务器。 HTMLUnit + Jsoup
    2. 以JSON格式将解析后的数据保存在服务器的文件系统中。 GSON
    3. 创建在Android应用程序请求时发送JSON文件的servlet。