我正在尝试使用jsoup获取一个url,以便从该url中下载一个imad,因为它无法正常工作。
我首先想找到的地方" div class =" rg_di" "第一次出现在html文件中, 而不是获取之后的网址:
a href="http://www.google.co.il/imgres?imgurl=http://michellepicker.files.wordpress.com/2011/03/grilled-chicken-mexican-style.jpg&imgrefurl=http://michellepicker.wordpress.com/2011/04/25/grilled-chicken-mexican-style-black-beans-guacamole/&h=522&w=700&tbnid=4hXCtCfljxmJXM:&zoom=1&docid=ajIrwZMUrP5_GM&ei=iVOqVPmDDYrnaJzYgIAM&tbm=isch"
这是html的网址:
这是我试过的代码:
try
{
doc = Jsoup.connect(url).get();
Element link = doc.select("div.rg_di").first();
Element link2 = link.select("a").first();
String relHref = link2.attr("href"); // == "/"
String absHref = link.attr("abs:href");
tmpResult = absHref;
}
catch (Exception e)
{
Log.e("Error", e.getMessage());
e.printStackTrace();
}
完整活动代码:
package com.androidbegin.parselogintutorial;
import com.androidbegin.parselogintutorial.SingleRecipe.urlTask;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
import com.koushikdutta.urlimageviewhelper.sample.UrlImageViewHelperSample;
import com.parse.GetCallback;
import com.parse.ParseException;
import com.parse.ParseObject;
import com.parse.ParseQuery;
import com.parse.ParseUser;
public class Bla extends Activity
{
ImageView iv,bm;
TextView recipeTitle;
String urlForImage = "";
@Override
protected void onCreate(Bundle savedInstanceState)
{
// TODO Auto-generated method stub
super.onCreate(savedInstanceState);
setContentView(R.layout.bla_layout);
new urlTask("grilled mexican chicken").execute("grilled mexican chicken");
//new DownloadImageTask((ImageView)findViewById(R.id.RecipeImage)).execute(urlForImage);
}
public class DownloadImageTask extends AsyncTask<String, Void, Bitmap>
{
ImageView bmImage;
public DownloadImageTask(ImageView bmImage) {
this.bmImage = bmImage;
}
protected Bitmap doInBackground(String... urls)
{
String urldisplay = urls[0];
Bitmap mIcon11 = null;
try
{
InputStream in = new java.net.URL(urldisplay).openStream();
mIcon11 = BitmapFactory.decodeStream(in);
in.close();
}
catch (Exception e)
{
Log.e("Error", e.getMessage());
e.printStackTrace();
}
return mIcon11;
}
protected void onPostExecute(Bitmap result)
{
bmImage.setImageBitmap(result);
}
}
public class urlTask extends AsyncTask<String, Void, String>
{
String str;
public urlTask(String str)
{
this.str = str;
}
String tmpResult = str;
Document doc;
protected String doInBackground(String... urls)
{
String urldisplay = urls[0];
String url = "https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955";
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); // Chrome not working
HtmlPage page = null;
try
{
page = webClient.getPage(url);
} catch (FailingHttpStatusCodeException e1)
{
// TODO Auto-generated catch block
e1.printStackTrace();
}
catch (MalformedURLException e1)
{
// TODO Auto-generated catch block
e1.printStackTrace();
}
catch (IOException e1)
{
// TODO Auto-generated catch block
e1.printStackTrace();
}
try
{
Document doc = Jsoup.parse(page.asXml());
Elements divs = doc.select(".rg_di");
for(Element div : divs)
{
Element img = div.select("a").get(0);
String link = img.attr("href");
System.out.println(link);
}
}
catch (Exception e)
{
e.printStackTrace();
}
return tmpResult;
}
protected void onPostExecute(String result)
{
result = tmpResult;
urlForImage = tmpResult;
}
}
}
感谢您的帮助
答案 0 :(得分:4)
我编辑了你的代码以摆脱错误403
而不是:
doc = Jsoup.connect(url).get();
写下这个:
doc = Jsoup.connect(url).userAgent("Mozilla").get();
link似乎是动态生成的。 Jsoup提取不包含 .rg_di 类的html,因此
doc.select("div.rg_di").first();
返回null,我们得到nullpointerexception。
jsoup
下载的html片段<img height="104" src="https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT-pctOxpuUcdq118aFU3s2miRfUa6Ev8eF-UsxARHV-vbcOUV8byEtt2YT" width="140">
我们所做的最好是获取每个img
代码并对其进行迭代,我们会获得图标链接列表
Document doc = Jsoup.connect(url).userAgent("Mozilla").get();
Elements imgs = doc.select("img");
for(Element img : imgs){
String link = img.attr("src");
System.out.println(link);
}
/textinputassistant/tia.png
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcT-pctOxpuUcdq118aFU3s2miRfUa6Ev8eF-UsxARHV-vbcOUV8byEtt2YT
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcQMq354p43ddqPcpV9-q_05YkmY7XUPgv6Sl2oQLqFxQ5-IkpGAAuFTLMM
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTW-RinkkW_fBdlHzTJn6vNmR85TR58geQgfjQnEJmOqzjq0Oi-z-8zXjg
https://encrypted-tbn2.gstatic.com/images?q=tbn:ANd9GcRUXLzKi3UyQ6mF9JD20Z1jYNhVxQz7tkhJIEGOL3kua8ptoQrvo8-Nco_X
https://encrypted-tbn3.gstatic.com/images?q=tbn:ANd9GcTverQlzF_hauCabscWF4wHLb_q7g9M_UDKO6LaldSRHhsTj7CxtVF2yvc
...
有许多解析动态内容的解决方案。 link
我实施 htmlunit 来呈现网页
import java.io.IOException;
import java.net.MalformedURLException;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import com.gargoylesoftware.htmlunit.BrowserVersion;
import com.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;
import com.gargoylesoftware.htmlunit.WebClient;
import com.gargoylesoftware.htmlunit.html.HtmlPage;
public class Main {
public static void main(String[] args) throws FailingHttpStatusCodeException, MalformedURLException, IOException {
String url = "https://www.google.co.il/search?q=grilled+mexican+chicken&es_sm=93&source=lnms&tbm=isch&sa=X&ei=h1OqVOH6B5bjaqGogvAP&ved=0CAgQ_AUoAQ&biw=1920&bih=955";
WebClient webClient = new WebClient(BrowserVersion.FIREFOX_24); // Chrome not working
HtmlPage page = webClient.getPage(url);
try {
Document doc = Jsoup.parse(page.asXml());
Elements divs = doc.select(".rg_di");
for(Element div : divs){
Element img = div.select("a").get(0);
String link = img.attr("href");
System.out.println(link);
}
} catch (Exception e) {
e.printStackTrace();
}
}
}
htmlunit 有自己的html解析api,但我会坚持使用更直观的 jsoup
只要你的目标是在Android设备上呈现和解析HTML页面,HTMLUnit不是一个好的选择source
HtmlUnit使用Android上不可用的Java类。 最重要的是,HtmlUnit使用了许多其他库,其中一些库可能对这些库有自己的依赖关系。因此,和HmlUnit一样棒,我认为让它在Android上运行可能不是一件容易的事。