Question

我正在尝试从谷歌搜索页面中提取（前5个）网址。我试图使用selenium web驱动程序提取它。我打开firefox并加载页面，但正则表达式与页面上的网址不匹配。我如何获取网址？

到目前为止我使用了以下代码：

import java.util.regex.Matcher;
import java.util.regex.Pattern;
import org.openqa.selenium.WebDriver;
import org.openga.selenium.firefox.FirefoxDriver;

public class Weburlext {

public static void main (String[] args){

String line = null;
Webdriver driver = new FirefoxDriver();
driver.ger("http://www.google.co.in/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=sample%20data");

String regex="@^(http\\:\\/\\/|https\\:\\/\\/)?([a-z0-9][a-z0-9\\-]*\\.)+[a-z0-9][a-z0-9\\-]*$@i";
Pattern p = Pattern.compile(regex,pattern.CASE_INSENSITIVE | Pattern.DOTALL);
Matcher m = p.matcher(line);

System.out.print(line);

driver.quit();

}
}

Answer 1

我很好奇您为什么使用正则表达式来匹配PageSource中的http模式。使用Selenium找到前5个结果的正确方法是找到结果元素然后得到属性＆＃34; href＆＃34;。请参阅以下代码：

driver.get("https://www.google.com.ph/webhp?sourceid=chrome-instant&ion=1&espv=2&ie=UTF-8#q=sample%20data");

List<WebElement> results = driver.findElements(By.cssSelector("div[class='rc'] > h3 > a"));
results.forEach(e -> System.out.println(e.getAttribute("href")));

从Google搜索页面中提取网址

1 个答案: