从字符串或列表创建字典

时间:2020-02-11 12:58:41

标签: python dictionary hashtable

背景

我想为给定的字符串或给定的列表生成哈希表。哈希表将元素视为key,将显示时间视为value。例如:

s = 'ababcd'
s = ['a', 'b', 'a', 'b', 'c', 'd']
dict_I_want = {'a':2,'b':2, 'c':1, 'd':1}

我的尝试

# method 1
from collections import Counter
s = 'ababcd' 
hash_table1 = Counter(s)

# method 2
s = 'ababdc' 
hash_table2 = dict()
for i in s:
    if hash_table2.get(i) == None:
        hash_table2[i] = 1
    else:
        hash_table2[i] += 1
hash_table1 == hash_table2

通常,我使用上面的2种方法。一种来自标准库,但在某些代码练习网站中是不允许的。另一个是从头开始编写的,但我认为它太长了。如果我使用dict理解,我会提出另外两种方法:

{i:s.count(i) for i in set(s)}
{i:s.count(i) for i in s}

问题

我想知道是否还有其他方法可以清楚或有效地从列表字符串初始化哈希表?

提到的我的4种方法的速度比较

from collections import Counter
import random,string,numpy,perfplot

def from_set(s):
    return {i:s.count(i) for i in set(s)}

def from_string(s):
    return {i:s.count(i) for i in s}

def handy(s):
    hash_table2 = dict()
    for i in s:
        if hash_table2.get(i) == None:
            hash_table2[i] = 1
        else:
            hash_table2[i] += 1
    return hash_table2

def counter(s):
    return Counter(s)

perfplot.show(
    setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
    kernels=[from_set,from_string,handy,counter],
    labels=['set','string','handy','counter'],
    n_range=[2 ** k for k in range(17)],
    xlabel="len(string)",
    equality_check= None
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
)

Speed comparation of methods

2 个答案:

答案 0 :(得分:2)

最好的方法是使用内置计数器,否则,您可以使用defualtdict,这与您的第二次尝试非常相似

from collections import defualtdict

d = defualtdict(int) # this makes every value 0 by defualt
for letter in string:
    d[letter] +=1

答案 1 :(得分:1)

我通常使用 Counter defaultdict 来创建出现频率。

令人惊讶的是,发帖人的from_set方法在大多数情况下都表现出色。

观察

  1. from_set(标记为“ set”)的总体效果最好
  2. 各种字典方法仅适用于较小的字符串长度(即 <100)
  3. Counter方法仅适用于一小段字符串长度。
  4. 对于大字符串,
  5. from_set比defaultdict快2.3倍,比Counter快1.5倍

算法

from collections import Counter
from collections import defaultdict

import random,string,numpy,perfplot

def from_set(s):
    " Use builtin count function for each item in set "
    return {i:s.count(i) for i in set(s)}

def counter(s):
    " Uses counter module "
    return Counter(s)

def normal_dic(s):
  " Update dictionary by checking if item in it or not "
  d = {}
  for i in s:
    if i in d:
      d[i] += 1
    else:
      d[i] = 1

  return d

def setdefault_dic(s):
  " Use setdefault to preset unknown keys "
  d = {}
  for i in s:
    d.setdefault(i, 0)
    d[i] += 1

  return d

def default_dic(s):
    " Used defaultdict from collections module "
    d = defaultdict(int)
    for i in s:
        d[i] += 1
    return d

def try_dic(s):
    " Use try/except to check if item in dictionary "
    d = {}
    for i in s:
        try:
            d[i] += 1
        except:
            d[i] = 1

    return d

测试代码

Uses Perfplot Module

out = perfplot.bench(
   setup=lambda n: ''.join(random.choices(string.ascii_uppercase + string.digits, k=n)),  # or simply setup=numpy.random.rand
    kernels=[from_set, counter, setdefault_dic, default_dic, try_dic],
    labels=['set', 'counter', 'setdefault', 'defaultdict', 'try_dic'],
    n_range=[2 ** k for k in range(17)],
    xlabel="len(string)",
    equality_check= None
    # More optional arguments with their default values:
    # title=None,
    # logx="auto",  # set to True or False to force scaling
    # logy="auto",
    # equality_check=numpy.allclose,  # set to None to disable "correctness" assertion
    # automatic_order=True,
    # colors=None,
    # target_time_per_measurement=1.0,
    # time_unit="s",  # set to one of ("auto", "s", "ms", "us", or "ns") to force plot units
    # relative_to=1,  # plot the timings relative to one of the measurements
    # flops=lambda n: 3*n,  # FLOPS plots
    )
out.show()
#out.save("perf.png")
out

图表

绝对值

图中的

from_set标签'set'。在下面的相对图中,相对性能要比此绝对图容易。

Absolute Values

相对值

图中的

from_set标签'set'。

from_set方法是水平线。对于更大的值,包括Counter和defaultdict在内的所有其他方法都在上面(耗时)。

Relative Values

表格

实际时间

       n  setdefault     try_dic  defaultdict    counter    from_set
     1.0       799.0       899.0       1299.0     6099.0     1399.0
     2.0      1099.0      1199.0       1599.0     6299.0     1699.0
     4.0      1699.0      1699.0       2199.0     6299.0     2399.0
     8.0      3199.0      3099.0       3499.0     6899.0     3699.0
    16.0      6099.0      5499.0       5899.0     7899.0     5900.0
    32.0     10899.0      9299.0       9899.0     8999.0    10299.0
    64.0     20799.0     15599.0      15999.0    11899.0    15099.0
   128.0     38499.0     25499.0      25899.0    16599.0    21899.0
   256.0     73100.0     44099.0      42700.0    26299.0    30299.0
   512.0    137999.0     77099.0      72699.0    43199.0    46699.0
  1024.0    286599.0    154500.0     144099.0    85700.0    79699.0
  2048.0    549700.0    289999.0     266799.0   157499.0   145699.0
  4096.0   1103899.0    577399.0     535499.0   309399.0   278999.0
  8192.0   2200099.0   1151500.0    1051799.0   606999.0   542499.0
 16384.0   4658199.0   2534399.0    2295300.0  1414199.0  1087799.0
 32768.0   9509200.0   5270200.0    4838000.0  3066600.0  2177200.0
 65536.0  19539500.0  10806300.0    9942100.0  6503299.0  4337599.0