从Google搜索结果

时间:2018-02-20 23:58:10

标签: java html json google-search

我想从谷歌搜索结果中提取片段,我使用以下代码解析谷歌搜索结果页面:

    Scanner scanner = new Scanner(System.in);
    System.out.println("Please enter the search term.");
    String searchTerm = scanner.nextLine();
    System.out.println("Please enter the number of results. Example: 5 10 20");
    int num = scanner.nextInt();
    scanner.close();

    String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;

    Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get();

    Elements results = doc.select("//div//div//span[contains(@class, 'st')]/text()");

    for (Element result : results) {
        String linkText = result.text();
        System.out.println("Text::" + linkText );//1000+ ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));
    }

它提取了生成的网址和标题,问题是这些网页片段位于“较低级别”的html标签中,如附图所示:

enter image description here

那么如何提取它们呢?!

1 个答案:

答案 0 :(得分:0)

使用查询:

'//em[.="Stack Overflow"]/following-sibling::text()'

'//em[text()="Stack Overflow"]/following-sibling::text()'