用于URL分类的模式匹配

时间:2012-04-06 16:18:16

标签: string algorithm pattern-matching

作为项目的一部分,我和其他一些人正在研究URL分类器。我们试图实现的内容实际上非常简单:我们只需查看URL并查找其中的相关关键字并相应地对页面进行分类。

例如:如果网址为:http://cnnworld/sports/abcd,我们会将其归类为“体育”类别

为实现这一目标,我们有一个数据库,其格式为:关键字 - >分类

现在我们正在做的是,对于每个URL,我们继续读取数据库中的所有数据项,并使用String.find()方法查看关键字是否出现在URL中。一旦找到,我们就会停止。

但这种方法有一些问题,主要是:

(i)我们的数据库非常庞大,而且这种重复查询的运行速度非常慢

(ii)页面可能属于多个类别,我们的方法不处理此类情况。当然,确保这一点的一种简单方法是,即使找到类别匹配,也要继续查询数据库,但这只会使事情变得更慢。

我正在考虑替代方案,并且想知道是否可以进行相反的操作 - 解析网址,找到其中出现的单词,然后仅在数据库中查询这些单词。

一个朴素的算法将在O(n ^ 2)中运行 - 在数据库中查询url中出现的所有子串。

我想知道是否有更好的方法来实现这一目标。有任何想法吗 ??提前谢谢你:)

4 个答案:

答案 0 :(得分:2)

在我们的商业分类器中,我们有一个4m关键字的数据库:)我们还搜索HTML的正文,有很多方法可以解决这个问题:

  1. 使用Aho-Corasick,我们使用了一种专门用于处理Web内容的修改算法,例如:tab,space,\ r,\ n作为空格,只有一个,因此两个空格将被视为一个空格,也忽略大小写。
  2. 另一种选择是将所有关键字放在树中(例如std :: map),这样搜索变得非常快,缺点是这需要内存,而且很多,但是如果它在服务器上,你就不会感觉不到。

答案 1 :(得分:1)

我认为你的建议是拆分URL以找到有用的位然后查询那些项目听起来像是一个不错的方式。

我把一些Java放在一起,可能有助于说明我认为这会带来什么样的代码。最有价值的部分可能是正则表达式,但我希望它的一般算法也能帮助一些:

import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;

public class CategoryParser 
{
    /** The db field that keywords should be checked against */
    private static final String DB_KEYWORD_FIELD_NAME = "keyword";

    /** The db field that categories should be pulled from */
    private static final String DB_CATEGORY_FIELD_NAME = "category";

    /** The name of the table to query */
    private static final String DB_TABLE_NAME = "KeywordCategoryMap";

    /**
     * This method takes a URL and from that text alone determines what categories that URL belongs in.
     * @param url - String URL to categorize
     * @return categories - A List<String&rt; of categories the URL seemingly belongs in
     */
    public static List<String> getCategoriesFromUrl(String url) {

        // Clean the URL to remove useless bits and encoding artifacts
        String normalizedUrl = normalizeURL(url);

        // Break the url apart and get the good stuff
        String[] keywords = tokenizeURL(normalizedUrl);

        // Construct the query we can query the database with
        String query = constructKeywordCategoryQuery(keywords);

        System.out.println("Generated Query: " + query);

        // At this point, you'd need to fire this query off to your database,
        // and the results you'd get back should each be a valid category
        // for your URL. This code is not provided because it's very implementation specific,
        // and you already know how to deal with databases.


        // Returning null to make this compile, even though you'd obviously want to return the
        // actual List of Strings
        return null;
    }

    /**
     * Removes the protocol, if it exists, from the front and
     * removes any random encoding characters
     * Extend this to do other url cleaning/pre-processing
     * @param url - The String URL to normalize
     * @return normalizedUrl - The String URL that has no junk or surprises
     */
    private static String normalizeURL(String url)
    {
        // Decode URL to remove any %20 type stuff
        String normalizedUrl = url;
        try {
            // I've used a URLDecoder that's part of Java here,
            // but this functionality exists in most modern languages
            // and is universally called url decoding
            normalizedUrl = URLDecoder.decode(url, "UTF-8");
        }
        catch(UnsupportedEncodingException uee)
        {
            System.err.println("Unable to Decode URL. Decoding skipped.");
            uee.printStackTrace();
        }

        // Remove the protocol, http:// ftp:// or similar from the front
        if (normalizedUrl.contains("://"))
        {
            normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
        }

        // Room here to do more pre-processing

        return normalizedUrl;
    }

    /**
     * Takes apart the url into the pieces that make at least some sense
     * This doesn't guarantee that each token is a potentially valid keyword, however
     * because that would require actually iterating over them again, which might be
     * seen as a waste.
     * @param url - Url to be tokenized
     * @return tokens - A String array of all the tokens
     */
    private static String[] tokenizeURL(String url)
    {
        // I assume that we're going to use the whole URL to find tokens in
        // If you want to just look in the GET parameters, or you want to ignore the domain
        // or you want to use the domain as a token itself, that would have to be
        // processed above the next line, and only the remaining parts split
        String[] tokens = url.split("\\b|_");

        // One could alternatively use a more complex regex to remove more invalid matches
        // but this is subject to your (?:in)?ability to actually write the regex you want

        // These next two get rid of tokens that are too short, also.

        // Destroys anything that's not alphanumeric and things that are
        // alphanumeric but only 1 character long
        //String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");

        // Destroys anything that's not alphanumeric and things that are
        // alphanumeric but only 1 or 2 characters long
        //String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");

        return tokens;
    }

    private static String constructKeywordCategoryQuery(String[] keywords)
    {
        // This will hold our WHERE body, keyword OR keyword2 OR keyword3
        StringBuilder whereItems = new StringBuilder();

        // Potential query, if we find anything valid
        String query = null;

        // Iterate over every found token
        for (String keyword : keywords)
        {
            // Reject invalid keywords
            if (isKeywordValid(keyword))
            {
                // If we need an OR
                if (whereItems.length() > 0)
                {
                    whereItems.append(" OR ");
                }

                // Simply append this item to the query
                // Yields something like "keyword='thisKeyword'"
                whereItems.append(DB_KEYWORD_FIELD_NAME);
                whereItems.append("='");
                whereItems.append(keyword);
                whereItems.append("'");
            }
        }

        // If a valid keyword actually made it into the query
        if (whereItems.length() > 0)
        {
            query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
                    + " WHERE " + whereItems.toString() + ";";
        }

        return query;
    }

    private static boolean isKeywordValid(String keyword)
    {
        // Keywords better be at least 2 characters long
        return keyword.length() > 1
                // And they better be only composed of letters and numbers
                && keyword.matches("\\w+")
                // And they better not be *just* numbers
                // && !keyword.matches("\\d+") // If you want this
                ;
    }

    // How this would be used
    public static void main(String[] args)
    {
        List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
        List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
    }
}

SO链接的生成查询如下所示:

SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'

有足够的优化空间,但我认为它比检查每个可能关键字的字符串要快得多。

答案 2 :(得分:1)

Aho-corasick算法最适合用一次遍历搜索中间字符串。您可以形成关键字的树(aho-corasick树)。在最后一个节点包含一个用特定关键字映射的数字。

现在,您只需要遍历树上的URL字符串。当你得到一些数字(在我们的场景中作为标志)时,这意味着我们得到了一些映射类别。继续在哈希映射上使用该数字,并找到相应的类别以供进一步使用。

我认为这会对你有所帮助。 转到此链接:good animation of aho-corasick by ivan

答案 3 :(得分:0)

如果您的(多个)类别少于关键字,则可以为每个类别创建一个正则表达式,以匹配该类别的任何关键字。然后,您将针对每个类别的正则表达式运行您的URL。这也将解决匹配多个类别的问题。