使用Jsoup,我如何获取每个链接中的每个信息?

时间:2012-12-08 18:07:51

标签: java jsoup

     package com.muthu;
     import java.io.IOException;
     import org.jsoup.Jsoup;
     import org.jsoup.helper.Validate;
     import org.jsoup.nodes.Document;
     import org.jsoup.nodes.Element;
     import org.jsoup.select.Elements;
     import org.jsoup.select.NodeVisitor;
     import java.io.BufferedWriter;
     import java.io.File;
     import java.io.FileWriter;
     import java.io.IOException;
     import org.jsoup.nodes.*;
     public class TestingTool 
     {
        public static void main(String[] args) throws IOException
        {
    Validate.isTrue(args.length == 0, "usage: supply url to fetch");
            String url = "http://www.stackoverflow.com/";
            print("Fetching %s...", url);
            Document doc = Jsoup.connect(url).get();
            Elements links = doc.select("a[href]");
            System.out.println(doc.text());
            Elements tags=doc.getElementsByTag("div");
            String alls=doc.text();
            System.out.println("\n");
            for (Element link : links)
            {
        print("  %s  ", link.attr("abs:href"), trim(link.text(), 35));
            }
            BufferedWriter bw = new BufferedWriter(new FileWriter(new File("C:/tool                 
            /linknames.txt")));        
         for (Element link : links) {
            bw.write("Link: "+ link.text().trim());
        bw.write(System.getProperty("line.separator"));       
       }    
      bw.flush();     
      bw.close();
    }           }
    private static void print(String msg, Object... args) {
        System.out.println(String.format(msg, args));
    }

    private static String trim(String s, int width) {
        if (s.length() > width)
            return s.substring(0, width-1) + ".";
        else
            return s;
    }
        }

1 个答案:

答案 0 :(得分:3)

如果您连接到URL,它将只解析当前页面。但您可以1.)连接到URL,2。)解析您需要的信息,3。)选择所有其他链接,4。)连接到它们和5.)只要有新链接就继续。

<强>考虑:

  • 您需要一个列表(?)或其他您存储已解析链接的内容
  • 您必须决定是否只需要此页面的链接或外部链接
  • 您必须跳过“约”,“联系”等页面

修改
(注意:您必须添加一些更改/错误处理代码)

List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited


public void visitUrl(String url) throws IOException
{
    url = url.toLowerCase(); // now its case insensitive

    if( !visitedUrls.contains(url) ) // Do this only if not visted yet
    {
        Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document

        /* ... Select your Data here ... */

        Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!

        for( Element next : nextLinks ) // Iterate over all Links
        {
            visitUrl(next.absUrl("href")); // Recursive call for all next Links
        }
    }
}

您必须在选择下一个链接的部分添加更多限制/检查(可能您想跳过/忽略一些);和一些错误处理。


编辑2:

要跳过忽略的链接,您可以使用:

  1. 创建一个Set / List / whatever,存储忽略的关键字
  2. 使用这些关键字填写
  3. 在使用新的解析链接调用visitUrl()方法之前,请检查此新Url是否包含任何已忽略的关键字。如果它至少包含一个,则会跳过它。
  4. 我稍微修改了一下这个例子(但是还没有测试过了!)。

    List<String> visitedUrls = new ArrayList<>(); // Store all links you've already visited
    Set<String> ignore = new HashSet<>(); // Store all keywords you want ignore
    
    // ...
    
    
    /*
     * Add keywords to the ignorelist. Each link that contains one of this
     * words will be skipped.
     * 
     * Do this in eg. constructor, static block or a init method.
     */
    ignore.add(".twitter.com");
    
    // ...
    
    
    public void visitUrl(String url) throws IOException
    {
        url = url.toLowerCase(); // Now its case insensitive
    
        if( !visitedUrls.contains(url) ) // Do this only if not visted yet
        {
            Document doc = Jsoup.connect(url).get(); // Connect to Url and parse Document
    
            /* ... Select your Data here ... */
    
            Elements nextLinks = doc.select("a[href]"); // Select next links - add more restriction!
    
            for( Element next : nextLinks ) // Iterate over all Links
            {
                boolean skip = false; // If false: parse the url, if true: skip it
                final String href = next.absUrl("href"); // Select the 'href' attribute -> next link to parse
    
                for( String s : ignore ) // Iterate over all ignored keywords - maybe there's a better solution for this
                {
                    if( href.contains(s) ) // If the url contains ignored keywords it will be skipped
                    {
                        skip = true;
                        break;
                    }
                }
    
                if( !skip )
                    visitUrl(next.absUrl("href")); // Recursive call for all next Links
            }
        }
    }
    

    通过以下方式解析下一个链接:

    final String href = next.absUrl("href");
    /* ... */
    visitUrl(next.absUrl("href"));
    

    但可能你应该为这部分添加一些停止条件。