将分层关键字与文档匹配

时间:2012-05-16 10:17:51

标签: python hierarchical-data

我有一个分层关键字树,表示为元组列表,其中第一个参数是“路径”,第二个是相应的关键字:

keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')]

列出连接'路径'和相应的文档(一个文档可以有多个'路径':

docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')]

我希望将每个文档与关键字匹配,并产生如下结果:

docdict={doc1:[('key1','key2','key3'),('key1','key4','key5')],doc2:[('key1','key2')]}

我的问题是如何最有效地获取所有(父)关键字?提前谢谢!

4 个答案:

答案 0 :(得分:1)

这几乎可以满足您的需求:

>>> docdict = {doc[-1]:[key[-1] for key in keys if doc[0].startswith(key[0])] for doc in docs}
>>> docdict
{'doc2': ['key1', 'key2'], 'doc1': ['key1', 'key4', 'key5']}

这正是您指定的内容:

>>> docdict = {}
>>> for doc in docs:
    docdict.setdefault(doc[-1],[]).append(tuple(key[-1] for key in keys if doc[0].startswith(key[0])))  
>>> docdict
{'doc2': [('key1', 'key2')], 'doc1': [('key1', 'key2', 'key3'), ('key1', 'key4', 'key5')]}

两者都是O(n*m)

答案 1 :(得分:1)

一个更易读的答案,如果你有很多这些答案,可能会更好地扩展。

docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')]
keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')]

keydict = dict(keys)
resultDict = {}

for doc in docs:
    (path, docname) = doc
    pathList = path.split(',')
    keyPath = []
    for i in range(0, len(pathList)):
        aPath = ','.join(pathList[:i+1])
        keyPath.append(keydict[aPath])

    if docname not in resultDict :
        resultDict[docname] = []
    resultDict[docname].append(tuple(keyPath))

print resultDict  

答案 2 :(得分:1)

这也是另一种解决方案:

keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')]
docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')]

def get_path(p):
    # tuples so that you can use them as dict keys
    return tuple(p.split(','))

# we need to find the keys based on the paths, so make the path the dict's key
keypaths = {get_path(p): key for p, key in keys}

docdict = {}
for p, doc in docs:
    path = get_path(p) # we need the path as a tuple or list, so that you can get the parents via slicing
    # get all parents of the path and the path itself.
    # we remove one part of the path at a time and keep the original path also
    all_paths = [path]+[path[:-i] for i in range(1,len(path))]
    # you need to keep each doc/path combination alone, so you need a list to store it
    if doc not in docdict:
        docdict[doc] = []
    # add the keys whose path is in one of the parents or the path itself
    docdict[doc].append([keypaths[path] for path in all_paths if path in keypaths])

print docdict # now, you see what you expect. :)
坦率地说,对于所有这些单行代码,代码变得不可读。所以,如果你同意,你应该更喜欢这个解决方案。

答案 3 :(得分:0)

编辑:我首先创建了一个从关键字获取直接父级的方法,但这不符合要求。根据我的理解,从路径中获取所有父关键字要好得多:

>>> def get_parents(keys, path):
    """ Get all parent keywords """
    # Get parent path (remove after last ',')
    parent_paths = [path]
    while ',' in path:
        path = ','.join(path.split(',')[:-1])
        parent_paths.append(path)
    return [key for path, key in keys if path in parent_paths]

>>> get_parents(keys, '0,2')
['key1', 'key4']
>>> from collections import defaultdict
>>> def create_doc_dict(keys, docs):
    """ Create the required dict """
    docdict = defaultdict(list)
    for path, doc in docs:
        docdict[doc].append(get_parents(keys, path))
    return docdict

>>> create_doc_dict(keys, docs)
defaultdict(<type 'list'>, {'doc2': [['key1', 'key2']], 'doc1': [['key1', 'key2', 'key3'], ['key1', 'key4', 'key5']]})