我有一个分层关键字树,表示为元组列表,其中第一个参数是“路径”,第二个是相应的关键字:
keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')]
列出连接'路径'和相应的文档(一个文档可以有多个'路径':
docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')]
我希望将每个文档与关键字匹配,并产生如下结果:
docdict={doc1:[('key1','key2','key3'),('key1','key4','key5')],doc2:[('key1','key2')]}
我的问题是如何最有效地获取所有(父)关键字?提前谢谢!
答案 0 :(得分:1)
这几乎可以满足您的需求:
>>> docdict = {doc[-1]:[key[-1] for key in keys if doc[0].startswith(key[0])] for doc in docs}
>>> docdict
{'doc2': ['key1', 'key2'], 'doc1': ['key1', 'key4', 'key5']}
这正是您指定的内容:
>>> docdict = {}
>>> for doc in docs:
docdict.setdefault(doc[-1],[]).append(tuple(key[-1] for key in keys if doc[0].startswith(key[0])))
>>> docdict
{'doc2': [('key1', 'key2')], 'doc1': [('key1', 'key2', 'key3'), ('key1', 'key4', 'key5')]}
两者都是O(n*m)
。
答案 1 :(得分:1)
一个更易读的答案,如果你有很多这些答案,可能会更好地扩展。
docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')]
keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')]
keydict = dict(keys)
resultDict = {}
for doc in docs:
(path, docname) = doc
pathList = path.split(',')
keyPath = []
for i in range(0, len(pathList)):
aPath = ','.join(pathList[:i+1])
keyPath.append(keydict[aPath])
if docname not in resultDict :
resultDict[docname] = []
resultDict[docname].append(tuple(keyPath))
print resultDict
答案 2 :(得分:1)
这也是另一种解决方案:
keys = [('0','key1'),('0,1','key2'),('0,1,12','key3'),('0,2','key4'),('0,2,30','key5')]
docs = [('0,1,12','doc1'),('0,2,30','doc1'),('0,1','doc2')]
def get_path(p):
# tuples so that you can use them as dict keys
return tuple(p.split(','))
# we need to find the keys based on the paths, so make the path the dict's key
keypaths = {get_path(p): key for p, key in keys}
docdict = {}
for p, doc in docs:
path = get_path(p) # we need the path as a tuple or list, so that you can get the parents via slicing
# get all parents of the path and the path itself.
# we remove one part of the path at a time and keep the original path also
all_paths = [path]+[path[:-i] for i in range(1,len(path))]
# you need to keep each doc/path combination alone, so you need a list to store it
if doc not in docdict:
docdict[doc] = []
# add the keys whose path is in one of the parents or the path itself
docdict[doc].append([keypaths[path] for path in all_paths if path in keypaths])
print docdict # now, you see what you expect. :)
坦率地说,对于所有这些单行代码,代码变得不可读。所以,如果你同意,你应该更喜欢这个解决方案。
答案 3 :(得分:0)
编辑:我首先创建了一个从关键字获取直接父级的方法,但这不符合要求。根据我的理解,从路径中获取所有父关键字要好得多:
>>> def get_parents(keys, path):
""" Get all parent keywords """
# Get parent path (remove after last ',')
parent_paths = [path]
while ',' in path:
path = ','.join(path.split(',')[:-1])
parent_paths.append(path)
return [key for path, key in keys if path in parent_paths]
>>> get_parents(keys, '0,2')
['key1', 'key4']
>>> from collections import defaultdict
>>> def create_doc_dict(keys, docs):
""" Create the required dict """
docdict = defaultdict(list)
for path, doc in docs:
docdict[doc].append(get_parents(keys, path))
return docdict
>>> create_doc_dict(keys, docs)
defaultdict(<type 'list'>, {'doc2': [['key1', 'key2']], 'doc1': [['key1', 'key2', 'key3'], ['key1', 'key4', 'key5']]})