如何从Java中的Google搜索中排除搜索结果(链接)

时间:2014-10-04 23:40:57

标签: java

我想过滤掉谷歌搜索中的所有网站链接。如果我搜索某些内容,我想获取网站的所有网站链接,谷歌会向我们展示。

首先,我想阅读完整的HTML内容。之后我想过滤掉所有重要的网址。例如 - >如果我说出"买鞋"进入谷歌 - >我希望得到像" www.amazon.in/Shoes"等等。

如果我正在启动我的计划,我只会获得一些网址,只有基于Google的网站,例如" google.de/intl/de/options /"

PS:我在Chrome浏览器和Firefox浏览器中使用相同的查询(" buy + shoes")检查了页面源代码,并注意到chrome浏览器提供的内容远远多于firefox-browser。我的感觉是我只获得了少量的网站结果,因为java连接就像Firefox浏览器一样,不是吗? 我如何获得谷歌出现的所有这些链接?

import java.io.BufferedReader;
import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.MalformedURLException;
import java.net.URL;
import java.net.URLConnection;
import java.nio.charset.Charset;
import java.util.Scanner;
import java.util.regex.Matcher;
import java.util.regex.Pattern;
public class findEveryUrl {
public static void main( String[] args ) throws IOException
{

    String gInput = "https://www.google.de/#q=";
    // setKeyWord asks you to enter the keyword into the console
    String fullUrl = gInput + setKeyWord();
    //fullUrl is used for the InputStream and "www." is the string, which is used for splitting
    findAllSubs( fullUrl, "www.");
    //System.out.println("given url: " + fullUrl);
}



/* 
 * @param <T> String type.
 * @param urlString has to be the full Url.
 * @param splitphrase is the String which is used for splitting. 
 * @return void
 */
static void findAllSubs( String urlString, String splitphrase )
{
    try
    {
        URL     url     = new URL( urlString );
        URLConnection yc = url.openConnection();
        BufferedReader in = new BufferedReader(new InputStreamReader(
                yc.getInputStream()));
        String inputLine ;  
        String array[];

        while ((inputLine = in.readLine()) != null){
            inputLine += in.readLine();
            array = inputLine.split(splitphrase);
            arrayToConsol(array);

        }
    }catch (IOException e) {
        e.printStackTrace();
    }

}



/* 
 * urlQuery() asks you for the search keyword for the google query
 * @return returns the keyword, which you wrote into the console
 */
public static String setKeyWord(){
    BufferedReader console = new BufferedReader(new InputStreamReader(System.in));
    System.out.print("Enter a KeyWord: ");
    //googles search engine url

    String keyWord = null;
    try {
        keyWord = console.readLine();
    } catch (IOException e) {
        // shouldn't be happen
        e.printStackTrace();
    }

    return keyWord;
}

public static void arrayToConsol(String[] array){
    for (String item : array) {
        System.out.println(item);
    }
}

public static void searchQueryToConsole(String url) throws IOException{
    URL googleSearch = new URL(url);
    URLConnection yc = googleSearch.openConnection();
    BufferedReader in = new BufferedReader(new InputStreamReader(
            yc.getInputStream()));
    String inputLine;
    while ((inputLine = in.readLine()) != null) 
        System.out.println(inputLine);
    in.close();
}}

1 个答案:

答案 0 :(得分:0)

这是简单易行的解决方案。

http://www.programcreek.com/2012/05/call-google-search-api-in-java-program/

但是如果你想用CSS选择器解析其他页面来找到它的优秀库JSoup。

Document doc = Jsoup.connect("http://en.wikipedia.org/").get();
Elements newsHeadlines = doc.select("#mp-itn b a");