我正在遍历单词列表以找到单词之间最常用的字符(即在列表[hello, hank]
中,'h'
计为出现两次,而'l'
计为出现一次)。 python列表工作正常,但我也在研究NumPy(dtype数组?)和Pandas。看起来Numpy可能是要走的路,但还有其他的套餐要考虑吗?我怎么能更快地使这个功能?
问题代码:
def mostCommon(guessed, li):
count = Counter()
for words in li:
for letters in set(words):
count[letters]+=1
return count.most_common()[:10]
感谢。
答案 0 :(得分:3)
选项1
def pir1(li):
sets = [set(s) for s in li]
ul = np.array(list(set.union(*sets)))
us = np.apply_along_axis(set, 1, ul[:, None])
c = (sets >= us).sum(1)
a = c.argsort()[:-11:-1]
return ul[a]
选项2
def pir2(li):
return Counter(chain.from_iterable([list(set(i)) for i in li])).most_common(10)
假设单词列表li
import pandas as pd
import numpy as np
from string import ascii_lowercase
li = pd.DataFrame(
np.random.choice(list(ascii_lowercase), (1000, 10))
).sum(1).tolist()
包括Divakar和OP的功能
def tabulate_occurrences(a):
chars = np.asarray(a).view('S1')
valid_chars = chars[chars!='']
unqchars, count = np.unique(valid_chars, return_counts=1)
return pd.DataFrame({'char':unqchars, 'count':count})
def topNchars(a, N = 10):
s = np.core.defchararray.lower(a).view('uint8')
unq, count = np.unique(s[s!=0], return_counts=1)
sidx = count.argsort()[-N:][::-1]
h = unq[sidx]
return [str(chr(i)) for i in h]
def mostCommon(li):
count = Counter()
for words in li:
for letters in set(words):
count[letters]+=1
return count.most_common()[:10]
测试
import pandas as pd
import numpy as np
from string import ascii_lowercase
from timeit import timeit
results = pd.DataFrame(
index=pd.RangeIndex(5, 405, 5, name='No. Words'),
columns=pd.Index('pir1 pir2 mostCommon topNchars'.split(), name='Method'),
)
np.random.seed([3,1415])
for i in results.index:
li = pd.DataFrame(
np.random.choice(list(ascii_lowercase), (i, 10))
).sum(1).tolist()
for j in results.columns:
v = timeit(
'{}(li)'.format(j),
'from __main__ import {}, li'.format(j),
number=100
)
results.set_value(i, j, v)
ax = results.plot(title='Time Testing')
ax.set_ylabel('Time of 100 iterations')
答案 1 :(得分:3)
这是使用其views-concept
-
def tabulate_occurrences(a): # Case sensitive
chars = np.asarray(a).view('S1')
valid_chars = chars[chars!='']
unqchars, count = np.unique(valid_chars, return_counts=1)
return pd.DataFrame({'char':unqchars, 'count':count})
def topNchars(a, N = 10): # Case insensitive
s = np.core.defchararray.lower(a).view('uint8')
unq, count = np.unique(s[s!=0], return_counts=1)
sidx = count.argsort()[-N:][::-1]
h = unq[sidx]
return [str(unichr(i)) for i in h]
示例运行 -
In [322]: a = ['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']
In [323]: tabulate_occurrences(a) # Case sensitive
Out[323]:
char count
0 I 3
1 S 2
2 e 2
3 i 1
4 o 1
5 r 2
6 s 2
7 u 1
8 y 1
In [533]: topNchars(a, 5) # Case insensitive
Out[533]: ['s', 'i', 'r', 'e', 'y']
In [534]: topNchars(a, 10) # Case insensitive
Out[534]: ['s', 'i', 'r', 'e', 'y', 'u', 'o']
答案 2 :(得分:2)
假设您只想要最常用的字符,每个字符每个字只计算一次:
>>> from itertools import chain
>>> l = ['hello', 'hank']
>>> chars = list(chain.from_iterable([list(set(word)) for word in l]))
>>> max(chars, key=chars.count)
'h'
由于C级实施,将max
与list.count
一起使用可能比使用Counter
快得多。
答案 3 :(得分:0)
这看起来已经很快了,并且在O(n)
中运行。我看到的唯一真正的改进机会是通过将li
分成多个部分来并行化此过程。
答案 4 :(得分:0)
这是一个纯Python解决方案,它使每个字符串唯一化,加入集合,然后计算结果(使用Divakar的示例列表)
>>> li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']
>>> Counter(e for sl in map(list, map(set, li)) for e in sl)
Counter({'I': 3, 'e': 2, 's': 2, 'S': 2, 'r': 2, 'o': 1, 'i': 1, 'u': 1, 'y': 1})
如果您希望将大写和小写统计为相同的字母:
>>> Counter(e for sl in map(list, map(set, [s.lower() for s in li])) for e in sl)
Counter({'i': 4, 's': 4, 'e': 2, 'r': 2, 'o': 1, 'u': 1, 'y': 1})
现在让时间:
from __future__ import print_function
from collections import Counter
import numpy as np
import pandas as pd
def dawg(li):
return Counter(e for sl in map(list, map(set, li)) for e in sl)
def nump(a):
chars = np.asarray(a).view('S1')
valid_chars = chars[chars!='']
unqchars, count = np.unique(valid_chars, return_counts=1)
return pd.DataFrame({'char':unqchars, 'count':count})
if __name__=='__main__':
import timeit
li=['er', 'IS' , 'you', 'Is', 'is', 'er', 'IS']
for f in (dawg, nump):
print(" ",f.__name__, timeit.timeit("f(li)", setup="from __main__ import f, li", number=100) )
结果:
dawg 0.00134205818176
nump 0.0347728729248
在这种情况下,Python解决方案明显更快
答案 5 :(得分:-1)
只做
counter = Counter(''.join(li))
most_common = counter.most_common()
你完成了