Question

我有数十万个端点网址，我想为其生成统计信息。例如，我有：

/a/b/c
/a/b/d
/a/c/d
/b/c/d
/b/d/e
/a/b/c
/b/c/d

我想创建一个看起来像这样的字典

{
   {'a':
        {'b':
             {'c': 2 },
             {'d': 1 }
        },
        {'c':
             {'d': 1 }
        }
    },
    {'b':
        {'c':
             {'d': 2}
        },
        {'d':
             {'e': 1}
        }
    }
}

有任何聪明的方法吗？

修改

我应该提到路径并不总是3个部分。可能有 /a/b/c/d/e/f/g/h ......等等。

Answer 1

如果路径在您的示例中看起来都像，那么这将起作用：

counts = {}
for p in paths:
   parts = p.split('/')
   branch = counts
   for part in parts[1:-1]:
      branch = branch.setdefault(part, {})
   branch[parts[-1]] = 1 + branch.get(parts[-1], 0)

这使用setdefault()和get()之类的字典方法来避免编写大量的if语句。

请注意，如果具有子目录的路径也可以单独显示，则此操作无效。然后不清楚counts的相应部分是否应包含数字或其他字典。在这种情况下，最好使用元组或自定义类为每个节点存储count和dict。

基本算法保持不变：

class Stats(object):
   def __init__(self):
      self.count = 0
      self.subdirs = {}

counts = Stats()
for p in paths:
   parts = p.split('/')
   branch = counts
   for part in parts[1:]:
      branch = branch.subdirs.setdefault(part, Stats())
   branch.count += 1

通过一些漂亮的印刷，你得到：

def printstats(stats, indent=''):
   print indent + str(stats.count) + ' times'
   for (d, s) in stats.subdirs.items():
       print indent + d + ':'
       printstats(s, indent + '  ')

>>> printstats(counts)
0 times
a:
  0 times
  c:
    0 times
    d:
      1 times
  b:
    0 times
    c:
      2 times
    d:
      1 times
...

Answer 2

修改

我修改了我的代码以适合您上面的评论（现在没有复杂的数据结构）。

def dictizeString(string, dictionary): while string.startswith('/'): string = string[1:] parts = string.split('/', 1) if len(parts) > 1: branch = dictionary.setdefault(parts[0], {}) dictizeString(parts[1], branch) else: if dictionary.has_key(parts[0]): # If there's an addition error here, it's because invalid data was added dictionary[parts[0]] += 1 else: dictionary[parts[0]] = 1

它会为每个项目存储[frequency, dictionary]的列表。

测试用例

>>> d = {} >>> dictizeString('/a/b/c/d', d) >>> dictizeString('/a/b/c/d', d) >>> dictizeString('/a/b/c/d', d) >>> dictizeString('/a/b/c/d', d) >>> dictizeString('/a/b/e', d) >>> dictizeString('/c', d) >>> d {'a': {'b': {'c': {'d': 4}, 'e': 1}}, 'c': 1}

Answer 3

旧的结果，但仍然接近谷歌的顶部，所以我会更新：你可以使用dpath-python。

$ easy_install dpath
>>> result = {}
>>> for path in my_list_of_paths:
>>> ... dpath.util.set(result, path, SOME_VALUE)

......就是这样。我不理解您在终点（1,2等）上预先计算这些值所使用的数学，但是您可以预先计算它并使用路径到值的字典而不是裸列表

>>> x = {'path/name': 0, 'other/path/name': 1}
>>> for (path, value) in x.iteritems():
>>> ... dpath.util.set(result, path, value)

这样的事情会起作用。

Answer 4

这是我的尝试：

class Result(object):
    def __init__(self):
        self.count = 0
        self._sub_results = {}

    def __getitem__(self, key):
        if key not in self._sub_results:
            self._sub_results[key] = Result()
        return self._sub_results[key]

    def __str__(self):
        return "(%s, %s)" % (self.count, self._sub_results)

    def __repr__(self):
        return str(self)

def process_paths(paths):
    path_result = Result()
    for path in paths:
        components = path[1:].split("/")
        local_result = path_result
        for component in components:
            local_result = local_result[component]
        local_result.count += 1
    return path_result

我已经将一些逻辑包含在Result类中，试图使算法本身更清晰。

Python：从路径递归创建字典

4 个答案: