鉴于来自norvig.com/big.txt的big.txt
,我的目标是快速计算双子座(想象一下,我必须重复计算100,000次)。
根据Fast/Optimize N-gram implementations in python,像这样提取bigrams是最优的:
_bigrams = zip(*[text[i:] for i in range(2)])
如果我使用Python3
,则生成器不会被评估,直到我使用list(_bigrams)
或其他一些函数来实现它。
import io
from collections import Counter
import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
while True:
_bigrams = zip(*[text[i:] for i in range(2)])
start = time.time()
top100 = Counter(_bigrams).most_common(100)
# Do some manipulation to text and repeat the counting.
text = manipulate(text, top100)
但每次迭代需要大约1秒以上,而100,000次迭代会太长。
我还试过了sklearn
CountVectorizer,但是提取,计算和获得前100名bigrams的时间与本机python相当。
然后,我使用Python multiprocessing and a shared counter和http://eli.thegreenplace.net/2012/01/04/shared-counter-with-pythons-multiprocessing稍作修改,尝试了一些multiprocessing
:
from multiprocessing import Process, Manager, Lock
import time
class MultiProcCounter(object):
def __init__(self):
self.dictionary = Manager().dict()
self.lock = Lock()
def increment(self, item):
with self.lock:
self.dictionary[item] = self.dictionary.get(item, 0) + 1
def func(counter, item):
counter.increment(item)
def multiproc_count(inputs):
counter = MultiProcCounter()
procs = [Process(target=func, args=(counter,_in)) for _in in inputs]
for p in procs: p.start()
for p in procs: p.join()
return counter.dictionary
inputs = [1,1,1,1,2,2,3,4,4,5,2,2,3,1,2]
print (multiproc_count(inputs))
但是在bigram计数中使用MultiProcCounter
每次迭代的时间甚至超过1秒。我不知道为什么会这样,使用int
示例的虚拟列表,multiproc_count
完美无缺。
我试过了:
import io
from collections import Counter
import time
with io.open('big.txt', 'r', encoding='utf8') as fin:
text = fin.read().lower().replace(u' ', u"\uE000")
while True:
_bigrams = zip(*[text[i:] for i in range(2)])
start = time.time()
top100 = Counter(multiproc_count(_bigrams)).most_common(100)
有没有办法在Python中真正快速计算双字母?
答案 0 :(得分:1)
import os, thread
text = 'I really like cheese' #just load whatever you want here, this is just an example
CORE_NUMBER = os.cpu_count() # may not be available, just replace with how many cores you have if it crashes
ready = []
bigrams = []
def extract_bigrams(cores):
global ready, bigrams
bigrams = []
ready = []
for a in xrange(cores): #xrange is best for performance
bigrams.append(0)
ready.append(0)
cpnt = 0#current point
iterator = int(len(text)/cores)
for a in xrange(cores-1):
thread.start_new(extract_bigrams2, (cpnt, cpnt+iterator+1, a)) #overlap is intentional
cpnt += iterator
thread.start_new(extract_bigrams2, (cpnt, len(text), a+1))
while 0 in ready:
pass
def extract_bigrams2(startpoint, endpoint, threadnum):
global ready, bigrams
ready[threadnum] = 0
bigrams[threadnum] = zip(*[text[startpoint+i:endpoint] for i in xrange(2)])
ready[threadnum] = 1
extract_bigrams(CORE_NUMBER)
thebigrams = []
for a in bigrams:
thebigrams+=a
print thebigrams
这个程序存在一些问题,比如它没有过滤掉空格或标点符号,但我制作了这个程序来显示你应该拍摄的内容。您可以轻松编辑它以满足您的需求。
此程序会自动检测您的计算机拥有多少核心,并创建该数量的线程,尝试均匀分布查找bigrams的区域。我只能在学校拥有的计算机上的在线浏览器中测试这段代码,所以我无法确定这完全有效。如果有任何问题或疑问,请留在评论中。
答案 1 :(得分:0)
我的建议:
Text= "The Project Gutenberg EBook of The Adventures of Sherlock Holmes"
"by Sir Arthur Conan Doyle"
# Counters
Counts= [[0 for x in range(128)] for y in range(128)]
# Perform the counting
R= ord(Text[0])
for i in range(1, len(Text)):
L= R; R= ord(Text[i])
Counts[L][R]+= 1
# Output the results
for i in range(ord('A'), ord('{')):
if i < ord('[') or i >= ord('a'):
for j in range(ord('A'), ord('{')):
if (j < ord('[') or j >= ord('a')) and Counts[i][j] > 0:
print chr(i) + chr(j), Counts[i][j]
Ad 1
Bo 1
EB 1
Gu 1
Ho 1
Pr 1
Sh 1
Th 2
be 1
ck 1
ct 1
dv 1
ec 1
en 2
er 2
es 2
he 3
je 1
lm 1
lo 1
me 1
nb 1
nt 1
oc 1
of 2
oj 1
ok 1
ol 1
oo 1
re 1
rg 1
rl 1
ro 1
te 1
tu 1
ur 1
ut 1
ve 1
此版本区分大小写;可能最好首先小写整个文本。