Question

这是我的问题。我有一个名为“sites.txt”的txt文件。在这些我键入随机互联网网站。我的目标是保存每个站点的第一张图像。我试图通过img标签过滤服务器响应，它实际上适用于某些网站，但有些网站没有。

它运行的网站img src以http：//开头......它不起作用的网站从其他任何东西开始。

我还尝试将http：//添加到没有它的img src图像中，但我仍然得到同样的错误：

    Exception in thread "main" java.net.MalformedURLException: no protocol:
    at java.net.URL.<init>(Unknown Source)

我目前的代码是：

    public static void main(String[] args) throws IOException{
    try {
        File file = new File ("sites.txt");
        Scanner scanner = new Scanner (file);
        String url;
        int counter = 0;
            while(scanner.hasNext()) 
                {   
                    url=scanner.nextLine();
                    URL page = new URL(url);
                    URLConnection yc = page.openConnection();
                       BufferedReader in = new BufferedReader(new InputStreamReader(yc.getInputStream()));
                       String inputLine = in.readLine();
                       while (!inputLine.toLowerCase().contains("img"))inputLine = in.readLine();
                       in.close();
                       String[] parts = inputLine.split(" ");
                       int i=0;
                       while(!parts[i].contains("src"))i++;
                       String destinationFile = "image"+(counter++)+".jpg";
                       saveImage(parts[i].substring(5,parts[i].length()-1), destinationFile);
                       String tmp=scanner.nextLine();
                       System.out.println(url);

                }
        scanner.close();
        }
            catch (FileNotFoundException e) 
            {
                System.out.println ("File not found!");
                System.exit (0);
            }

}

public static void saveImage(String imageUrl, String destinationFile) throws IOException {
    // TODO Auto-generated method stub
    URL url = new URL(imageUrl);
    String fileName = url.getFile();
    String destName = fileName.substring(fileName.lastIndexOf("/"));
    System.out.println(destName);
    InputStream is = url.openStream();
    OutputStream os = new FileOutputStream(destinationFile);

    byte[] b = new byte[2048];
    int length;

    while ((length = is.read(b)) != -1) {
        os.write(b, 0, length);
    }

    is.close();
    os.close();
}

我还得到了一个使用apache jakarte http客户端库的提示，但我完全不知道如何使用那些我会感激任何帮助。

Answer 1

网址（类型为URI）需要 a scheme才能生效。在这种情况下，http。

当您在浏览器中输入www.google.com时，浏览器会推断您的意思是http://并自动为您预先添加。 Java没有这样做，因此你的例外。

确保您始终拥有http://。您可以使用正则表达式轻松解决此问题：

String fixedUrl = stringUrl.replaceAll("^((?!http://).{7})", "http://$1");

或

if(!stringUrl.startsWith("http://"))
    stringUrl = "http://" + stringUrl;

Answer 2

另一种解决方案

只需尝试使用包含静态便捷方法的ImageIO来查找ImageReaders和ImageWriters，然后执行简单的编码和解码。

示例代码：

// read a image from the URL
// I used the URL that is your profile pic on StackOverflow
BufferedImage image = ImageIO
        .read(new URL(
                "https://www.gravatar.com/avatar/3935223a285ab35a1b21f31248f1e721?s=32&d=identicon&r=PG&f=1"));

// save the image
ImageIO.write(image, "jpg", new File("resources/avatar.jpg"));

Answer 3

当您为网站的HTML图像元素及其src属性进行抓取时，您会遇到几种不同的网址表示形式。

一些例子是：

resource = https://google.com/images/srpr/logo9w.png
resource = google.com/images/srpr/logo9w.png
resource = //google.com/images/srpr/logo9w.png
resource = /images/srpr/logo9w.png
resource = images/srpr/logo9w.png

对于第二个到第五个，您需要构建其余的URL。

第二个可能更难以区分第四个和第五个，但我确信有解决方法。 URL Standard让我相信你不会经常看到它，因为我认为它在技术上并不合适。

第三种情况非常简单。如果resource变量以//开头，那么您只需要将协议/方案添加到其中。您可以使用site对象执行此操作：

url = site.getProtocol() + ":" + resource

对于第四和第五种情况，您需要在整个网站的网址前加上资源。

这是一个使用jsoup来解析HTML的示例应用程序，以及一个构建资源URL的简单实用工具方法。您对buildResourceUrl方法感兴趣。此外，它没有处理第二种情况;我会把它留给你。

import java.io.*;
import java.net.*;
import org.jsoup.*;
import org.jsoup.nodes.*;
import org.jsoup.select.*;

public class SiteScraper {

    public static void main(String[] args) throws IOException {
        URL site = new URL("https://google.com/");
        Document doc = Jsoup.connect(site.toString()).get();
        Elements images = doc.select("img");
        for (Element image : images) {
            String src = image.attr("src");
            System.out.println(buildResourceUrl(site, src));
        }
    }

    static URL buildResourceUrl(URL site, String resource) 
            throws MalformedURLException {
        if (!resource.matches("^(http|https|ftp)://.*$")) {
            if (resource.startsWith("//")) {
                return new URL(site.getProtocol() + ":" + resource);
            } else {
                return new URL(site.getProtocol() + "://" + site.getHost() + "/" 
                        + resource.replaceAll("^/", ""));
            }
        }
        return new URL(resource);
    }
}

这显然不会涵盖所有内容，但这只是一个开始。当您尝试访问的URL位于站点根目录的子目录中时（即http://some.place/under/the/rainbow.html），您可能会遇到问题。您甚至可能会在src属性中遇到base64 encoded data URI's ...这实际上取决于具体情况以及您愿意走多远。

从URL保存第一个图像

3 个答案: