我写了一个代码,它应该对文件的句子和两个列表的元素 - keywords
和keywords2
进行一些操作。它如下 -
import os
keywords=['a','b']
keywords2=['c','d mvb']
def foo(sentence,k2):
gs_list=[] #####
for k in keywords: #####
if k in sentence: #####
gs_list.append(k) #####
for k in gs_list:
if (k in sentence) and (k2 in sentence):
print 'a match'
return 4
for path, dirs, files in os.walk(r'F:\M.Tech\for assigning cl\selected\random 100'):
for file in files:
sentences=open(file).readlines();
for sentence in sentences:
if sentence.startswith('!series_title'):
for k2 in keywords2:
foo(sentence,k2)
我已经标记了有问题的代码部分。这篇文章(让我们称之为BETA)基本上形成了所选句子中 的关键词列表。因此,必须仅使用这些关键字来执行将来的操作。
此代码大约需要 47秒才能运行100个文件。现在我试图想办法加快速度。 keywords2
中有大约50个元素。所以我认为我基本上在函数func
内运行BETA 50次,而我需要的只是列表keywords
和sentence
。我在主代码中已经有了这两个,所以我将这部分转移到了代码的主要部分 -
import os
keywords=['a','b']
keywords2=['c','d mvb']
def foo(sentence,k2):
for k in gs_list:
if (k in sentence) and (k2 in sentence):
print 'a match'
return 4
for path, dirs, files in os.walk(r'F:\M.Tech\for assigning cl\selected\random 100'):
for file in files:
sentences=open(file).readlines();
for sentence in sentences:
if sentence.startswith('!series_title'):
gs_list=[] #####
for k in keywords: #####
if k in sentence: #####
gs_list.append(k) #####
for k2 in keywords2:
foo(sentence,k2)
我的想法是,这将确保这个列表形成过程仅针对每个句子发生一次,而不是像以前那样发生50次。这肯定会提高代码的速度。但是这段代码实际上花了 89秒来完成相同的100个文件。
我无法理解为什么这比以前的代码花费更多时间。有什么想法吗?
完整代码 -
import os
import re
import time
start_time = time.time()
a = open('F:\M.Tech\patterns for gmk_down.txt','r').readlines()
a1 = open('F:\M.Tech\patterns for gmk_up.txt','r').readlines()
keywords2=a+a1
ri2 = open(r'F:\M.Tech\for assigning cl\rules occurence\s\induced two.txt', 'w')
keywords = open('F:\M.Tech\mouse_gs_small_simple_reduced.txt','r').readlines() # this has the new small GS
keystripped = [k.rstrip().lower() for k in keywords]
c=0
def foo(s, gmk):
if gmk in s: # checking if gmk is in the line
l = re.split('\s|(?<!\d)[,.]|[,.](?!\d)|;|[()]|-', s) # split the line by comma, semicolon and space to check for gmks and gs.
filter(None, l) # remove empty elements in the list
#gs_list = [k for k in keystripped if k in s] # <-------- PIECE IN QUESTION --------
for gs in gs_list: # gene symbols
gs1 = re.split('\s|(?<!\d)[,.]|[,.](?!\d)|;|-', gs)
gs1=filter(None, gs1)
gmk1 = re.split('\s|(?<!\d)[,.]|[,.](?!\d)|;|-', gmk)
gmk1=filter(None, gmk1)
if any(l[i:i+len(gs1)]==gs1 for i in xrange(len(l)-len(gs1)+1)) and (any(l[i:i+len(gmk1)]==gmk1 for i in xrange(len(l)-len(gmk1)+1))): # this ensures that both gs and gmk are in l, as a unit(i.e. and in order) otherwise it was detecting things like 'beta c' from beta cells
# UPTO THIS POINT WE HAVE ESTABLISHED THAT THE GMK AND GS ARE INDEED IN THE LINE
k1 = '_MKKEYWORD_1_'
k2 = '_SKEYWORD_2_'
#print gmk
text = re.sub(re.escape(gmk), k1, s, flags=re.I) # because of this replacement, we dont have the problem of counting r from behind etc.
text = re.sub(r'(\b%s\b)' % (re.escape(gs)), k2, text, flags=re.I)
lt = text.split()
d_idx = {k1:[], k2:[]}
for k,v in enumerate(lt):
if k1 in v:
d_idx[k1].append(k)
if k2 in v:
d_idx[k2].append(k)
distance = 8
data = []
for idx1 in d_idx[k1]:
for idx2 in d_idx[k2]:
d = abs(idx1 - idx2)
if d<=distance:
data.append((d,idx1,idx2))
data.sort(key=lambda x: x[0])
for i in range (0, len(data)):
aq = data[i]
loq = min(aq[1], aq[2])
hiq = max(aq[1], aq[2])
brrq = lt[max(0, loq-6):hiq+6]
brq = " ".join(brrq)
if data:
cl(s, gmk, gs, gs_list, data)
def cl(s1, gmk1, gs1, gs_list1, data1): # output will be the confidence level
if gmk1 == 'induced':
if re.search(r'(%s.*?-induced)' %gs1, br0, re.I|re.S):
ri2.write('good')
return 4
c=0
for path, dirs, files in os.walk(r'F:\M.Tech\for assigning cl\selected\random 100'):
for file in files:
sentences = open(os.path.join(path,file),'r').readlines();
print("--- %s seconds ---" % (time.time() - start_time))
for s in sentences:
if s.startswith('!series_title'):
gs_list = [k for k in keystripped if k in s] #<------- PIECE IN QUESTION --------
for k2 in keywords2:
k2 = k2.rstrip().lower()
foo(s, k2)
ri2.close()
print("--- %s seconds ---" % (time.time() - start_time))
答案 0 :(得分:1)
您没有将gs_list
传递给foo
。使用全局变量可能会降低脚本速度。
另外,考虑让BETA成为列表理解。这应该是你需要的:
gs_list = [k for k in keywords if k in sentence]