使用Jsoup解析URL我得到了重复的URL' s

时间:2014-09-15 15:49:51

标签: java jsoup

我解析特定的Url并尝试在allInnerLinks ArrayList和allExternalLinks ArrayList中保存所有外部URL(同一个域)。

public void go() {
    Document doc;
    baseUrl = CountLinks.result3;
    try {

        // need http protocol

        doc = Jsoup
                .connect(url)
                .userAgent(
                        "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:25.0) Gecko/20100101 Firefox/25.0")
                .referrer("http://www.google.com").timeout(1000 * 5)
                .ignoreContentType(true).get();
        // get page title
        String title = doc.title();

        // get all links
        Elements links = doc.select("a[href]");

        for (Element link : links) {
            // !!!
            // String absUrl = link.absUrl("href");
            String absUrl = link.attr("abs:href");


            // get the value from href attribute
            if (absUrl.contains(baseUrl)
                    && !(absUrl.contains("mailto"))) {
                allInnerLinks.add(absUrl);
                allInnerLinksCounter++;
            } else {
                allExternalLinks.add(absUrl);
                allExternalLinksCounter++;
            }

        }

    } catch (NullPointerException e) {
        // TODO Auto-generated catch block
        e.printStackTrace();
    } catch (HttpStatusException e) {
        e.printStackTrace();
        System.out.println(e.getUrl());
    } catch (IOException e) {
        e.printStackTrace();
    }
}

但最后我有重复的元素。链接末尾会显示相同的URL,但数字符号#。我无法理解我是如何得到的:

PAGEURL                                                     EXTERNAL URLS       
----------------------------------------------------------------------------------------
http://hostingmaks.com/category/news/                       https://meetings.webex.com/
http://hostingmaks.com/category/news/#                      https://meetings.webex.com/

出现这种情况的原因是什么?

1 个答案:

答案 0 :(得分:0)

我在下面写了一个简单的方法来检查是否有一个尾随的hashtag /井号/符号返回一个布尔值。

import java.util.regex.Matcher;
import java.util.regex.Pattern;

public boolean hasHashTag(String url) {
    int index = url.lastIndexOf("#");
    if(index == -1) {
        return false;
    } else {
        Pattern p = Pattern.compile("[^a-z0-9 ]", Pattern.CASE_INSENSITIVE);
        Matcher m = p.matcher(url.substring(index+1));
        System.out.println(url.substring(index+1) + "   "+  (index + 1));
        return !m.find();
    }
}

您现在可以使用此方法过滤掉重复项。

if(hasHashTag(URLHERE)) {
    //don't add to urls to search
} else {
    //add url to search
}