以编程方式采取的页面与普通的谷歌页面不同?

时间:2013-06-21 11:27:08

标签: java parsing html-parsing google-search

我们希望以编程方式获取当前的Google页面。我们使用不同的编程语言技术,但我们无法获得正确的(当前)谷歌页面。

Java代码示例

    public class GoogleParser {

public static void main(String[] args){
      GoogleParser googleParser = new GoogleParser();
      googleParser.execute();
}
public void execute(){
String[] params = {"ankara nüfusu"};    
     final URL url = encodeGoogleQuery(params);

       System.out.println("Downloading [" + url + "]...\n\n\n\n\n");
        try {
final String html = downloadString(url);
System.out.println(html);
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}
private static String downloadString(final InputStream stream) throws IOException {
final ByteArrayOutputStream out = new ByteArrayOutputStream();
int ch;
while (-1 != (ch = stream.read()))
    out.write(ch);
return out.toString();
}
  private static String downloadString(final URL url) throws IOException {
       final String agent = "Mozilla/21.0 (Windows; U; Windows 7; en-US)";
       final URLConnection connection = url.openConnection();
       connection.setRequestProperty("User-Agent", agent);
       final InputStream stream = connection.getInputStream();
       return downloadString(stream);
   }

private static URL encodeGoogleQuery(final String[] args) {
        try {
            final StringBuilder localAddress = new StringBuilder();
            localAddress.append("/search?q=");

            for (int i = 0; i < args.length; i++) {
                final String encoding = URLEncoder.encode(args[i], "UTF-8");
                localAddress.append(encoding);
                if (i + 1 < args.length)
                    localAddress.append("+");
            }

            return new URL("http", "www.google.com", localAddress.toString());

        } catch (final IOException e) {
            // Errors should not occur under normal circumstances.
            throw new RuntimeException(
                    "An error occurred while encoding the query arguments.");
        }
    }
}

Java Code get this html page Google current Page

 First image Java Code Result Page
 Second image Google Current Page

java从谷歌获取的Html页面与当前的谷歌页面不同。

  1. 不同的结果
  2. 不包含Google即时结果(4,551 milyon(2011)部分)
  3. 不包含Google Graph结果(右侧安卡拉信息)
  4. 比当前
  5. 更旧的页面
  6. 导航属性(Web,İmages,视频)左侧,通常是下面的搜索栏
  7. 您是否知道如何以编程方式获取谷歌的当前(最后)页面。然而,其他语言的解决方案对解决问题很重要。

    感谢您的回复

1 个答案:

答案 0 :(得分:0)

Google非常聪明地检测发送请求的人:

  1. 确保您发送与浏览器相同的Cookie
  2. 确保发送相同或有效的浏览器代理字符串