此程序读取搜索查询的文本文件,使用它们查询Google,并将所有链接输出到另一个文件。该程序适用于几百个查询,但突然工作并报告错误。
(我将编辑这篇文章并发布从我的程序的哪些行返回的错误)。
任何想法可能会发生什么?
import java.io.*;
import java.net.URL;
import java.net.URLConnection;
import java.util.ArrayList;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
import java.util.Scanner;
public class GoogleSearcher {
public static void main(String [] args) throws Exception {
Scanner in = new Scanner (System.in);
System.out.println("Input list of queries to search:");
String loc = in.nextLine();
loc = loc.replace("\\", "");
System.out.println("Where to write file?");
String writeLoc = in.nextLine();
writeLoc = writeLoc.replace("\\", " ");
FileInputStream fstream = new FileInputStream(loc);
BufferedReader br = new BufferedReader(new InputStreamReader(fstream));
String line;
PrintWriter pw = new PrintWriter(new FileWriter(writeLoc + "Google Search Results.txt"));
while ((line = br.readLine()) != null) {
System.out.println("Searching: \"" + line + "\"");
ArrayList<String> t = googleSearch(line);
if (t != null){
for (int a = 0; a < t.size(); a++){
pw.write(t.get(a) + System.lineSeparator());
}
}
}
br.close();
pw.close();
}
public static ArrayList<String> googleSearch(String search) throws Exception {
try {
String query = "https://www.google.com/search?q=" + search.replace(" ", "%20");
String page = getSearchContent(query);
ArrayList<String> links = parseLinks(page);
return formatLinks(links);
} catch (Exception e) {
e.printStackTrace();
System.out.println("Error... Trying next search");
return null;
}
}
public static ArrayList<String> formatLinks(ArrayList a){
ArrayList<String> formatted = new ArrayList<String>();
for (int i = 0; i < a.size(); i++){
String t = (String)a.get(i);
t = t.replace("%3F", "?");
t = t.replace("%3D", "=");
formatted.add(t);
}
return formatted;
}
public static String getString(InputStream is) {
StringBuilder sb = new StringBuilder();
BufferedReader br = new BufferedReader(new InputStreamReader(is));
String line;
try {
while ((line = br.readLine()) != null) {
sb.append(line);
}
} catch (IOException e) {
e.printStackTrace();
} finally {
if (br != null) {
try {
br.close();
} catch (IOException e) {
e.printStackTrace();
}
}
}
return sb.toString();
}
public static String getSearchContent(String path) throws Exception {
final String agent = "Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)";
URL url = new URL(path);
final URLConnection connection = url.openConnection();
connection.setRequestProperty("User-Agent", agent);
final InputStream stream = connection.getInputStream();
return getString(stream);
}
public static ArrayList<String> parseLinks(final String html) throws Exception {
ArrayList<String> result = new ArrayList<String>();
String pattern1 = "<h3 class=\"r\"><a href=\"/url?q=";
String pattern2 = "\">";
Pattern p = Pattern.compile(Pattern.quote(pattern1) + "(.*?)" + Pattern.quote(pattern2));
Matcher m = p.matcher(html);
while (m.find()) {
String domainName = m.group(0).trim();
// remove unwanted text
domainName = domainName.substring(domainName.indexOf("/url?q=") + 7);
domainName = domainName.substring(0, domainName.indexOf("&"));
result.add(domainName);
}
return result;
}
}
答案 0 :(得分:2)
那是因为它是以这种方式设计的。每当Google检测到某种自动化软件正在获取其结果时,它都会要求进行人工验证并显示验证码。
See this answer from support.google.com.
&#34;来自您的计算机网络的异常流量&#34;
您可能会看到&#34;我们的系统检测到您的异常流量 计算机网络&#34;如果它看起来像你的网络上的电脑或手机 正在向Google发送自动流量。
Google认为自动流量
- 从机器人,计算机程序,自动服务或搜索刮刀发送搜索
- 使用向Google发送搜索的软件,查看网站或网页在Google上的排名
看到此消息时该怎么办
错误页面最有可能显示CAPTCHA(带有框的波浪形词) 在它下面)。要继续使用Google,请在中输入波浪形的单词 框。这就是我们如何认识你是一个人,而不是一个机器人。键入后 CAPTCHA正确,消息将消失,您可以使用谷歌 试。
如果您想在自己的网站中使用Google搜索,则可以使用仅为此目的创建的Google Custom Search。
答案 1 :(得分:1)
好的,在运行了几轮程序之后,我收到了以下错误。
Error... Trying next search
Searching: "autoradiograph"
java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Daustria&q=EgTLe7ahGOKSrcMFIhkA8aeDSylzciRE9l0cz9fUg6u2MeGh-muxMgNyY24
at sun.net.www.protocol.http.HttpURLConnection.getInputStream0(HttpURLConnection.java:1876)
at sun.net.www.protocol.http.HttpURLConnection.getInputStream(HttpURLConnection.java:1474)
at sun.net.www.protocol.https.HttpsURLConnectionImpl.getInputStream(HttpsURLConnectionImpl.java:254)
at application.GoogleSearcher.getSearchContent(GoogleSearcher.java:90)
at application.GoogleSearcher.googleSearch(GoogleSearcher.java:45)
at application.GoogleSearcher.main(GoogleSearcher.java:32)
java.io.IOException: Server returned HTTP response code: 503 for URL: https://ipv4.google.com/sorry/index?continue=https://www.google.com/search%3Fq%3Dautoradiograph&q=EgTLe7ahGOKSrcMFIhkA8aeDS_cQehdQreptc4cInLKEPYpprweeMgNyY24
这种情况正在发生,因为谷歌阻止了自动搜索,以防止对其服务器发起Denial of Service攻击。
Google可能不允许您执行自动搜索。这是一个link to their support page.。这是该页面的摘录。
自动查询
Google的服务条款不允许在未经Google事先明确许可的情况下向我们的系统发送任何类型的自动查询。发送自动查询会消耗资源,并包括使用任何软件(例如WebPosition Gold)向Google发送自动查询,以确定网站或网页在Google搜索结果中如何排列各种查询。除了排名检查之外,未经许可对Google进行其他类型的自动访问也违反了我们的网站站长指南和服务条款。