作为项目的一部分,我和其他一些人正在研究URL分类器。我们试图实现的内容实际上非常简单:我们只需查看URL并查找其中的相关关键字并相应地对页面进行分类。
例如:如果网址为:http://cnnworld/sports/abcd,我们会将其归类为“体育”类别
为实现这一目标,我们有一个数据库,其格式为:关键字 - >分类
现在我们正在做的是,对于每个URL,我们继续读取数据库中的所有数据项,并使用String.find()方法查看关键字是否出现在URL中。一旦找到,我们就会停止。
但这种方法有一些问题,主要是:
(i)我们的数据库非常庞大,而且这种重复查询的运行速度非常慢
(ii)页面可能属于多个类别,我们的方法不处理此类情况。当然,确保这一点的一种简单方法是,即使找到类别匹配,也要继续查询数据库,但这只会使事情变得更慢。
我正在考虑替代方案,并且想知道是否可以进行相反的操作 - 解析网址,找到其中出现的单词,然后仅在数据库中查询这些单词。
一个朴素的算法将在O(n ^ 2)中运行 - 在数据库中查询url中出现的所有子串。
我想知道是否有更好的方法来实现这一目标。有任何想法吗 ??提前谢谢你:)
答案 0 :(得分:2)
在我们的商业分类器中,我们有一个4m关键字的数据库:)我们还搜索HTML的正文,有很多方法可以解决这个问题:
答案 1 :(得分:1)
我认为你的建议是拆分URL以找到有用的位然后查询那些项目听起来像是一个不错的方式。
我把一些Java放在一起,可能有助于说明我认为这会带来什么样的代码。最有价值的部分可能是正则表达式,但我希望它的一般算法也能帮助一些:
import java.io.UnsupportedEncodingException;
import java.net.URLDecoder;
import java.util.List;
public class CategoryParser
{
/** The db field that keywords should be checked against */
private static final String DB_KEYWORD_FIELD_NAME = "keyword";
/** The db field that categories should be pulled from */
private static final String DB_CATEGORY_FIELD_NAME = "category";
/** The name of the table to query */
private static final String DB_TABLE_NAME = "KeywordCategoryMap";
/**
* This method takes a URL and from that text alone determines what categories that URL belongs in.
* @param url - String URL to categorize
* @return categories - A List<String&rt; of categories the URL seemingly belongs in
*/
public static List<String> getCategoriesFromUrl(String url) {
// Clean the URL to remove useless bits and encoding artifacts
String normalizedUrl = normalizeURL(url);
// Break the url apart and get the good stuff
String[] keywords = tokenizeURL(normalizedUrl);
// Construct the query we can query the database with
String query = constructKeywordCategoryQuery(keywords);
System.out.println("Generated Query: " + query);
// At this point, you'd need to fire this query off to your database,
// and the results you'd get back should each be a valid category
// for your URL. This code is not provided because it's very implementation specific,
// and you already know how to deal with databases.
// Returning null to make this compile, even though you'd obviously want to return the
// actual List of Strings
return null;
}
/**
* Removes the protocol, if it exists, from the front and
* removes any random encoding characters
* Extend this to do other url cleaning/pre-processing
* @param url - The String URL to normalize
* @return normalizedUrl - The String URL that has no junk or surprises
*/
private static String normalizeURL(String url)
{
// Decode URL to remove any %20 type stuff
String normalizedUrl = url;
try {
// I've used a URLDecoder that's part of Java here,
// but this functionality exists in most modern languages
// and is universally called url decoding
normalizedUrl = URLDecoder.decode(url, "UTF-8");
}
catch(UnsupportedEncodingException uee)
{
System.err.println("Unable to Decode URL. Decoding skipped.");
uee.printStackTrace();
}
// Remove the protocol, http:// ftp:// or similar from the front
if (normalizedUrl.contains("://"))
{
normalizedUrl = normalizedUrl.split(":\\/\\/")[1];
}
// Room here to do more pre-processing
return normalizedUrl;
}
/**
* Takes apart the url into the pieces that make at least some sense
* This doesn't guarantee that each token is a potentially valid keyword, however
* because that would require actually iterating over them again, which might be
* seen as a waste.
* @param url - Url to be tokenized
* @return tokens - A String array of all the tokens
*/
private static String[] tokenizeURL(String url)
{
// I assume that we're going to use the whole URL to find tokens in
// If you want to just look in the GET parameters, or you want to ignore the domain
// or you want to use the domain as a token itself, that would have to be
// processed above the next line, and only the remaining parts split
String[] tokens = url.split("\\b|_");
// One could alternatively use a more complex regex to remove more invalid matches
// but this is subject to your (?:in)?ability to actually write the regex you want
// These next two get rid of tokens that are too short, also.
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 character long
//String[] tokens = url.split("(?:[\\W_]+\\w)*[\\W_]+");
// Destroys anything that's not alphanumeric and things that are
// alphanumeric but only 1 or 2 characters long
//String[] tokens = url.split("(?:[\\W_]+\\w{1,2})*[\\W_]+");
return tokens;
}
private static String constructKeywordCategoryQuery(String[] keywords)
{
// This will hold our WHERE body, keyword OR keyword2 OR keyword3
StringBuilder whereItems = new StringBuilder();
// Potential query, if we find anything valid
String query = null;
// Iterate over every found token
for (String keyword : keywords)
{
// Reject invalid keywords
if (isKeywordValid(keyword))
{
// If we need an OR
if (whereItems.length() > 0)
{
whereItems.append(" OR ");
}
// Simply append this item to the query
// Yields something like "keyword='thisKeyword'"
whereItems.append(DB_KEYWORD_FIELD_NAME);
whereItems.append("='");
whereItems.append(keyword);
whereItems.append("'");
}
}
// If a valid keyword actually made it into the query
if (whereItems.length() > 0)
{
query = "SELECT DISTINCT(" + DB_CATEGORY_FIELD_NAME + ") FROM " + DB_TABLE_NAME
+ " WHERE " + whereItems.toString() + ";";
}
return query;
}
private static boolean isKeywordValid(String keyword)
{
// Keywords better be at least 2 characters long
return keyword.length() > 1
// And they better be only composed of letters and numbers
&& keyword.matches("\\w+")
// And they better not be *just* numbers
// && !keyword.matches("\\d+") // If you want this
;
}
// How this would be used
public static void main(String[] args)
{
List<String> soQuestionUrlClassifications = getCategoriesFromUrl("http://stackoverflow.com/questions/10046178/pattern-matching-for-url-classification");
List<String> googleQueryURLClassifications = getCategoriesFromUrl("https://www.google.com/search?sugexp=chrome,mod=18&sourceid=chrome&ie=UTF-8&q=spring+is+a+new+service+instance+created#hl=en&sugexp=ciatsh&gs_nf=1&gs_mss=spring%20is%20a%20new%20bean%20instance%20created&tok=lnAt2g0iy8CWkY65Te75sg&pq=spring%20is%20a%20new%20bean%20instance%20created&cp=6&gs_id=1l&xhr=t&q=urlencode&pf=p&safe=off&sclient=psy-ab&oq=url+en&gs_l=&pbx=1&bav=on.2,or.r_gc.r_pw.r_cp.r_qf.,cf.osb&fp=2176d1af1be1f17d&biw=1680&bih=965");
}
}
SO链接的生成查询如下所示:
SELECT DISTINCT(category) FROM KeywordCategoryMap WHERE keyword='stackoverflow' OR keyword='com' OR keyword='questions' OR keyword='10046178' OR keyword='pattern' OR keyword='matching' OR keyword='for' OR keyword='url' OR keyword='classification'
有足够的优化空间,但我认为它比检查每个可能关键字的字符串要快得多。
答案 2 :(得分:1)
Aho-corasick算法最适合用一次遍历搜索中间字符串。您可以形成关键字的树(aho-corasick树)。在最后一个节点包含一个用特定关键字映射的数字。
现在,您只需要遍历树上的URL字符串。当你得到一些数字(在我们的场景中作为标志)时,这意味着我们得到了一些映射类别。继续在哈希映射上使用该数字,并找到相应的类别以供进一步使用。
我认为这会对你有所帮助。 转到此链接:good animation of aho-corasick by ivan
答案 3 :(得分:0)
如果您的(多个)类别少于关键字,则可以为每个类别创建一个正则表达式,以匹配该类别的任何关键字。然后,您将针对每个类别的正则表达式运行您的URL。这也将解决匹配多个类别的问题。