对类似的URL进行分组/查找通用URL模式(Python)

时间:2016-07-14 22:30:57

标签: python parsing url

我有大约10万个网址,每个网址都被标记为正面或负面。我想看看哪种类型的网址对应正面? (同样是否定的)

我首先将子域分组并确定最常见的正面和负面子域。

现在,对于具有相等正负比的子域,我想进一步剖析并寻找模式。示例模式:

http://www.clarin.com/politica/ (pattern: domain/section)
http://www.clarin.com/tema/manifestaciones.html (pattern: domain/tag/tag_name)
http://www.clarin.com/buscador?q=protesta (pattern: domain/search?=search_term)

链接不仅限于clarin.com。

有关如何发现此类模式的任何建议?

1 个答案:

答案 0 :(得分:0)

解决了此问题:结束了finding largest common substring问题的提示。

解决方案包括从网址的每个字符构建一个解析树。树中的每个节点都存储正,负,总计数。最后,修剪树以返回最常见的模式。

<强>代码:

def find_patterns(incoming_urls):
    urls = {}
    # make the tree
    for url in incoming_urls:
        url, atype = line.strip().split("____")  # assuming incoming_urls is a list with each entry of type url__class
        if len(url) < 100:   # Take only the initial 100 characters to avoid building a sparse tree
            bound = len(url) + 1
        else:
            bound = 101
        for x in range(1, bound):
            if url[:x].lower() not in urls:
                urls[url[:x].lower()] = {'positive': 0, 'negative': 0, 'total': 0}
            urls[url[:x].lower()][atype] += 1
            urls[url[:x].lower()]['total'] += 1

    new_urls = {}
    # prune the tree
    for url in urls:
        if urls[url]['total'] < 5:  # For something to be called as common pattern, there should be at least 5 occurrences of it.
            continue
        urls[url]['negative_percentage'] = (float(urls[url]['negative']) * 100) / urls[url]['total']
        if urls[url]['negative_percentage'] < 85.0: # Assuming I am interested in finding url patterns for negative class
            continue
        length = len(url)
        found = False
        # iterate to see if a len+1 url is present with same total count
        for second in urls:
            if len(second) <= length:
                continue
            if url == second[:length] and urls[url]['total'] == urls[second]['total']:
                found = True
                break
        # discard urls with length less than 20
        if not found and len(url) > 20:
            new_urls[url] = urls[url]

    print "URL Pattern; Positive; Negative; Total; Negative (%)"
    for url in new_urls:
        print "%s; %d; %d; %d; %.2f" % (
            url, new_urls[url]['positive'], new_urls[url]['negative'], new_urls[url]['total'],
            new_urls[url]['negative_percentage'])