Question

以下是我尝试抓取图片宽度和高度的示例亚马逊链接：

http://images.amazon.com/images/P/0099441365.01.SCLZZZZZZZ.jpg

我正在使用jsoup，以下是我的代码：

import java.io.*;
import org.jsoup.*;
import org.jsoup.nodes.Document;
import org.jsoup.select.Elements;
public class Crawler_main {

/**
 * @param args
 */
public static void main(String[] args) {
    // TODO Auto-generated method stub
    String filepath = "C:/imagelinks.txt";
    try (BufferedReader br = new BufferedReader(new FileReader(filepath))) {
        String line;
        String width;
        //String height;
        while ((line = br.readLine()) != null) {
           // process the line.
            System.out.println(line);
            Document doc = Jsoup.connect(line).ignoreContentType(true).get();
            //System.out.println(doc.toString());
            Elements jpg = doc.getElementsByTag("img");
            width = jpg.attr("width");
            System.out.println(width);
            //String title = doc.title();
        }
    }
    catch (FileNotFoundException ex){
        System.out.println("File not found");
    }
    catch(IOException ex){
        System.out.println("Unable to read line");
    }
    catch (Exception ex){
        System.out.println("Exception occured");
    }
}

}

获取html，但是当我提取width属性时，它返回null。当我打印获取的html时，它包含garbadge字符（我猜它是实际的图像信息，我称之为garbadge字符。例如：

我甚至无法将document.toString（）结果粘贴到此编辑器中。救命啊！

Answer 1

问题是你要获取jpg文件，而不是任何HTML。对ignoreContentType（true）的调用提供了一个线索，因为它的documentation状态：

解析响应时忽略文档的Content-Type。默认情况下，这是false，无法识别的content-type将导致抛出IOException。（例如，这是为了防止通过尝试解析JPEG二进制图像来产生垃圾。）

如果你想获得实际jpg文件的宽度，可以使用this SO answer：

BufferedImage bimg = ImageIO.read(new File(filename));
int width          = bimg.getWidth();
int height         = bimg.getHeight();

来自amazon.com链接的jsoup爬行图像宽度和高度

1 个答案: