我正在尝试抓取网址,以便在每个网址中提取其他网址。为此,我阅读了页面的HTML代码,读取每一行的每一行,将其与模式匹配,然后提取所需的部分,如下所示:
public class SimpleCrawler {
static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";
static Pattern UrlPattern = Pattern.compile (pattern);
static Matcher UrlMatcher;
public static void main(String[] args) {
try {
URL url = new URL("https://stackoverflow.com/");
BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
while((String line = br.readLine())!=null){
UrlMatcher= UrlPattern.matcher(line);
if(UrlMatcher.find())
{
String extractedPath = UrlMatcher.group(1);
String extractedPath2 = UrlMatcher.group(2);
System.out.println("http://www."+extractedPath+".com"+extractedPath2);
}
}
} catch (Exception ex) {
ex.printStackTrace();
}
}
}
然而,我想解决它的一些问题:
http
和www
甚至两者都是可选的?我遇到过很多情况,有链接没有任何一个或两个部分,所以正则表达式将不匹配。http
之间,直到域扩展,第二个是在它之后的任何内容。然而,这会导致两个子问题:
2.1 由于它是HTML代码,因此可能会将提取到URL之后的其他HTML标记提取到。
2.2 在System.out.println("http://www."+extractedPath+".com"+extractedPath2);
我无法确定它是否显示正确的网址(无论之前的问题如何),因为我不知道它与哪个域扩展名匹配。http
和https
? 答案 0 :(得分:1)
怎么样:
try {
boolean foundMatch = subjectString.matches(
"(?imx)^\n" +
"(# Scheme\n" +
" [a-z][a-z0-9+\\-.]*:\n" +
" (# Authority & path\n" +
" //\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=]+@)? # User\n" +
" ([a-z0-9\\-._~%]+ # Named host\n" +
" |\\[[a-f0-9:.]+\\] # IPv6 host\n" +
" |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\]) # IPvFuture host\n" +
" (:[0-9]+)? # Port\n" +
" (/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/? # Path\n" +
" |# Path without authority\n" +
" (/?[a-z0-9\\-._~%!$&'()*+,;=:@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?)?\n" +
" )\n" +
"|# Relative URL (no scheme or authority)\n" +
" ([a-z0-9\\-._~%!$&'()*+,;=@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/? # Relative path\n" +
" |(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)+/?) # Absolute path\n" +
")\n" +
"# Query\n" +
"(\\?[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
"# Fragment\n" +
"(\\#[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
"$");
} catch (PatternSyntaxException ex) {
// Syntax error in the regular expression
}
答案 1 :(得分:0)
有一个图书馆。我用过HtmlCleaner。它完成了这项工作。
您可以在以下位置找到它: http://htmlcleaner.sourceforge.net/javause.php
使用jsoup的另一个例子(未经测试): http://jsoup.org/cookbook/extracting-data/example-list-links
相当可读。
你可以增强它,选择< A>标签或其他,HREF等...
或更精确的情况(HreF,HRef,......):用于锻炼
import org.htmlcleaner.*;
public static Vector<String> HTML2URLS(String _source)
{
Vector<String> result=new Vector<String>();
HtmlCleaner cleaner = new HtmlCleaner();
// Principal Node
TagNode node = cleaner.clean(_source);
// All nodes
TagNode[] myNodes =node.getAllElements(true);
int s=myNodes.length;
for (int pos=0;pos<s;pos++)
{
TagNode tn=myNodes[pos];
// all attributes
Map<String,String> mss=tn.getAttributes();
// Name of tag
String name=tn.getName();
// Is there href ?
String href="";
if (mss.containsKey("href")) href=mss.get("href");
if (mss.containsKey("HREF")) href=mss.get("HREF");
if (name.equals("a")) result.add(href);
if (name.equals("A")) result.add(href);
}
return result;
}