我想从谷歌搜索结果中提取片段,我使用以下代码解析谷歌搜索结果页面:
Scanner scanner = new Scanner(System.in);
System.out.println("Please enter the search term.");
String searchTerm = scanner.nextLine();
System.out.println("Please enter the number of results. Example: 5 10 20");
int num = scanner.nextInt();
scanner.close();
String searchURL = GOOGLE_SEARCH_URL + "?q="+searchTerm+"&num="+num;
Document doc = Jsoup.connect(searchURL).userAgent("Mozilla/5.0").get();
Elements results = doc.select("//div//div//span[contains(@class, 'st')]/text()");
for (Element result : results) {
String linkText = result.text();
System.out.println("Text::" + linkText );//1000+ ", URL::" + linkHref.substring(6, linkHref.indexOf("&")));
}
它提取了生成的网址和标题,问题是这些网页片段位于“较低级别”的html标签中,如附图所示:
那么如何提取它们呢?!
答案 0 :(得分:0)
使用xpath查询:
'//em[.="Stack Overflow"]/following-sibling::text()'
或
'//em[text()="Stack Overflow"]/following-sibling::text()'