Question

我正在尝试使用Java检索Google搜索查询结果的html。也就是说，如果我在Google.com上搜索特定短语，我想检索生成的网页的html（包含可能匹配链接的页面及其描述，URL等等。） / p>

我尝试使用我在相关帖子中找到的以下代码执行此操作：

import java.io.*;
import java.net.*;
import java.util.*;

public class Main {

    public static void main (String args[]) {

        URL url;
        InputStream is = null;
        DataInputStream dis;
        String line;

        try {
            url = new URL("https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
            is = url.openStream();  // throws an IOException
            dis = new DataInputStream(new BufferedInputStream(is));

            while ((line = dis.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException ioe ) {
                // nothing to see here
            }
        }
    }
}

来自：How do you Programmatically Download a Webpage in Java

此代码中使用的网址是通过Google主页上的Google搜索查询获得的。出于某种原因，我不明白，如果我在网络浏览器的URL栏中编写我想要搜索的短语，然后在代码中使用结果搜索结果页面的URL，我会收到403错误。

但是，此代码未返回搜索查询结果页面的html。相反，它返回了Google主页的源代码。

在进行进一步研究后，我注意到如果您查看Google搜索查询结果的源代码（通过右键单击搜索结果页面的背景并选择“查看页面源”）并将其与源代码进行比较谷歌主页，它们都是相同的。

如果不是查看搜索结果页面的源代码而是保存搜索结果页面的html（通过按ctrl + s），我可以获取我正在寻找的html。

有没有办法使用Java检索搜索结果页面的html？

谢谢！

Answer 1

与其从标准谷歌搜索中解析生成的HTML页面，或许您最好不要查看官方Custom Search api以更实用的格式返回Google的结果。 API肯定是要走的路;否则，如果Google要更改google.com前端html的某些功能，您的代码可能会中断。 API旨在供开发人员使用，您的代码将不那么脆弱。

要回答您的问题：我们无法根据您提供的信息真正帮助您。你的代码似乎检索stackoverflow的html;从您链接的问题中精确复制并粘贴代码。您是否尝试更改代码？您实际尝试使用什么网址来检索Google搜索结果？

我尝试使用url = new URL("http://www.google.com/search?q=test");运行您的代码，我个人收到HTTP错误403禁止。快速搜索问题表明，如果我不在Web请求中提供User-Agent标头，则会发生这种情况，但如果您实际上正在返回HTML，那么这并不能完全帮助你。如果您希望获得特定帮助，则必须提供更多信息 - 尽管切换到自定义搜索API可能会解决您的问题。

编辑：原始问题中提供的新信息;现在可以直接回答问题了！

在数据包捕获java发送的Web请求并应用一些基本调试之后，我发现了你的问题......让我们来看看吧！

以下是Java使用您提供的示例网址发送的Web请求：

GET / HTTP/1.1
User-Agent: Java/1.6.0_30
Host: www.google.com
Accept: text/html, image/gif, image/jpeg, *; q=.2, */*; q=.2
Connection: keep-alive

请注意，请求似乎忽略了大部分URL ...只留下“GET /”。这很奇怪。我不得不看一下这个。

根据Java URL类的文档（这是所有网页的标准），A URL may have appended to it a "fragment", also known as a "ref" or a "reference". The fragment is indicated by the sharp sign character "#" followed by more characters ... This fragment is not technically part of the URL.

我们来看看您的示例网址...

https://www.google.com/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951

注意“＃”是文件路径中的第一个字符？ Java只是忽略了“＃”之后的所有内容，因为尖锐的符号仅由客户端/ Web浏览器使用 - 这会留下您的网址https://www.google.com/。嘿，至少它按预期工作了！

我无法确切地告诉你谷歌正在做什么，但尖锐的符号网址肯定意味着谷歌通过一些客户端（ajax / javascript）脚本返回查询结果。我愿意打赌你直接发送到服务器的任何查询（即没有“＃”符号）没有正确的标题将返回403禁止错误 - 看起来他们鼓励你使用API：）< / p>

edit2：根据Tengji Zhang回答问题，这里是工作代码，返回google查询“test”的结果

    URL url;
    InputStream is = null;
    DataInputStream dis;
    String line;
    URLConnection c;

    try {
        url = new URL("https://www.google.com/search?q=test");
        c = url.openConnection();
        c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
        c.connect();
        is = c.getInputStream();
        dis = new DataInputStream(new BufferedInputStream(is));
        while ((line = dis.readLine()) != null) {
            System.out.println(line);
        }
    } catch (MalformedURLException mue) {
         mue.printStackTrace();
    } catch (IOException ioe) {
         ioe.printStackTrace();
    } finally {
        try {
            is.close();
        } catch (IOException ioe ) {
            // nothing to see here
        }
    }

Answer 2

我建议你试试http://seleniumhq.org/

有一个很好的谷歌搜索教程

http://code.google.com/p/selenium/wiki/GettingStarted

Answer 3

您没有在代码中设置User-Agent。

URLConnection.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");

或者您可以阅读“http://www.google.com/robots.txt”。此文件会告诉您谷歌服务器允许哪个网址。

以下代码成功。

package org.test.stackoverflow;

import java.io.*;
import java.net.*;
import java.util.*;

public class SearcherRetriver {
    public static void main (String args[]) {

        URL url;
        InputStream is = null;
        DataInputStream dis;
        String line;
        URLConnection c;

        try {
            url = new URL("https://www.google.com.hk/#hl=en&output=search&sclient=psy-ab&q=UCF&oq=UCF&aq=f&aqi=g4&aql=&gs_l=hp.3..0l4.1066.1471.0.1862.3.3.0.0.0.0.382.1028.2-1j2.3.0...0.0.OxbV2LOXcaY&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=579625c09319dd01&biw=944&bih=951");
            c = url.openConnection();
            c.setRequestProperty("User-Agent", "Mozilla/5.0 (Windows NT 6.1) AppleWebKit/535.19 (KHTML, like Gecko) Chrome/18.0.1025.168");
            c.connect();
            is = c.getInputStream();
            dis = new DataInputStream(new BufferedInputStream(is));
            while ((line = dis.readLine()) != null) {
                System.out.println(line);
            }
        } catch (MalformedURLException mue) {
             mue.printStackTrace();
        } catch (IOException ioe) {
             ioe.printStackTrace();
        } finally {
            try {
                is.close();
            } catch (IOException ioe ) {
                // nothing to see here
            }
        }
    }
}

如何检索搜索引擎查询结果的HTML？

3 个答案: