我有大量的dependencies {
implementation fileTree(dir: 'libs', include: ['*.jar'])
implementation 'com.android.support:appcompat-v7:28.0.0'
implementation 'com.android.support.constraint:constraint-layout:1.1.3'
testImplementation 'junit:junit:4.12'
androidTestImplementation 'com.android.support.test:runner:1.0.2'
androidTestImplementation 'com.android.support.test.espresso:espresso-core:3.0.2'
implementation 'com.google.firebase:firebase-database:16.0.1'
implementation 'com.google.firebase:firebase-database:16.0.1:15.0.0'
}
和大量的concepts
。我要按敏感度顺序在sentences
中标识concepts
。我正在按照以下方式使用sentences
和multithreading
来执行此任务。
for loops
这是迄今为止我拥有的最有效的代码。但是,使用我的真实数据集仍然很慢。
我的import queue
import threading
sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
'data mining is the analysis step of the knowledge discovery in databases process or kdd']
concepts = ['data mining', 'database systems', 'databases process',
'interdisciplinary subfield', 'information', 'knowledge discovery',
'methods', 'machine learning', 'patterns', 'process']
def func(sentence):
sentence_tokens = []
for item in concepts:
index = sentence.find(item)
if index >= 0:
sentence_tokens.append((index, item))
sentence_tokens = [e[1] for e in sorted(sentence_tokens, key=lambda x: x[0])]
return sentence_tokens
def do_find_all_concepts(q_in, l_out):
while True:
sentence = q_in.get()
l_out.append(func(sentence))
q_in.task_done()
# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []
counting = 0
# any reasonable number of workers
num_threads = 4
for i in range(num_threads):
worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
# once there's nothing but daemon threads left, Python exits the program
worker.daemon = True
worker.start()
# put all the input on the queue
for s in sentences:
sentences_q.put(s)
counting = counting + 1
print(counting)
# wait for the entire queue to be processed
sentences_q.join()
print(output)
列表按字母顺序排列。因此,我想知道python中是否有concepts
或indexing
机制仅使用句子中第一个单词的字符来搜索serialisation
列表的一部分(而不是搜索整个concepts
列表)
我主要关心的是时间复杂度(因为根据我目前的时间估算,运行数据需要将近1.5周的时间)。空间复杂性不是问题。
如果需要,我很乐意提供更多详细信息。