Question

我正在尝试抓取网址，以便在每个网址中提取其他网址。为此，我阅读了页面的HTML代码，读取每一行的每一行，将其与模式匹配，然后提取所需的部分，如下所示：

    public class SimpleCrawler {
  static String pattern="https://www\\.([^&]+)\\.(?:com|net|org|)/([^&]+)";

 static Pattern UrlPattern = Pattern.compile (pattern);
 static Matcher UrlMatcher;



    public static void main(String[] args) {

            try {
            URL url = new URL("https://stackoverflow.com/");
            BufferedReader br = new BufferedReader(new InputStreamReader(url.openStream()));
                       while((String line = br.readLine())!=null){
                        UrlMatcher= UrlPattern.matcher(line);


                if(UrlMatcher.find())
                {
            String extractedPath = UrlMatcher.group(1);
            String extractedPath2 = UrlMatcher.group(2);

            System.out.println("http://www."+extractedPath+".com"+extractedPath2);

                }
                }
        } catch (Exception ex) {
            ex.printStackTrace();
        }

    }

}

然而，我想解决它的一些问题：

如何制作http和www甚至两者都是可选的？我遇到过很多情况，有链接没有任何一个或两个部分，所以正则表达式将不匹配。
根据我的代码，我创建了两个组，一个在http之间，直到域扩展，第二个是在它之后的任何内容。然而，这会导致两个子问题： 2.1 由于它是HTML代码，因此可能会将提取到URL之后的其他HTML标记提取到。 2.2 在System.out.println("http://www."+extractedPath+".com"+extractedPath2);我无法确定它是否显示正确的网址（无论之前的问题如何），因为我不知道它与哪个域扩展名匹配。
最后但同样重要的是，我想知道如何匹配http和https？

Answer 1

怎么样：

try {
    boolean foundMatch = subjectString.matches(
        "(?imx)^\n" +
        "(# Scheme\n" +
        " [a-z][a-z0-9+\\-.]*:\n" +
        " (# Authority & path\n" +
        "  //\n" +
        "  ([a-z0-9\\-._~%!$&'()*+,;=]+@)?              # User\n" +
        "  ([a-z0-9\\-._~%]+                            # Named host\n" +
        "  |\\[[a-f0-9:.]+\\]                            # IPv6 host\n" +
        "  |\\[v[a-f0-9][a-z0-9\\-._~%!$&'()*+,;=:]+\\])  # IPvFuture host\n" +
        "  (:[0-9]+)?                                  # Port\n" +
        "  (/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?          # Path\n" +
        " |# Path without authority\n" +
        "  (/?[a-z0-9\\-._~%!$&'()*+,;=:@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?)?\n" +
        " )\n" +
        "|# Relative URL (no scheme or authority)\n" +
        " ([a-z0-9\\-._~%!$&'()*+,;=@]+(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)*/?  # Relative path\n" +
        " |(/[a-z0-9\\-._~%!$&'()*+,;=:@]+)+/?)                            # Absolute path\n" +
        ")\n" +
        "# Query\n" +
        "(\\?[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "# Fragment\n" +
        "(\\#[a-z0-9\\-._~%!$&'()*+,;=:@/?]*)?\n" +
        "$");
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

Answer 2

有一个图书馆。我用过HtmlCleaner。它完成了这项工作。

您可以在以下位置找到它： http://htmlcleaner.sourceforge.net/javause.php

使用jsoup的另一个例子（未经测试）： http://jsoup.org/cookbook/extracting-data/example-list-links

相当可读。

你可以增强它，选择＆lt; A＆gt;标签或其他，HREF等...

或更精确的情况（HreF，HRef，......）：用于锻炼

import org.htmlcleaner.*;


public static Vector<String> HTML2URLS(String _source)
{
    Vector<String> result=new Vector<String>();

    HtmlCleaner cleaner = new HtmlCleaner();

    // Principal Node
    TagNode node = cleaner.clean(_source);

    // All nodes
    TagNode[] myNodes =node.getAllElements(true);

    int s=myNodes.length;
    for (int pos=0;pos<s;pos++)
        {
        TagNode tn=myNodes[pos];

        // all attributes
        Map<String,String> mss=tn.getAttributes();

        // Name of tag
        String name=tn.getName();

        // Is there href ?
        String href="";
        if (mss.containsKey("href")) href=mss.get("href");
        if (mss.containsKey("HREF")) href=mss.get("HREF");

        if (name.equals("a")) result.add(href);
        if (name.equals("A")) result.add(href);
        }
    return result;
}

抓取网址以提取该网页中的所有其他网址

2 个答案: