Question

我有一个sentences列表，大约有500,000 sentences。还有concepts左右的13,000,000 concepts列表。对于每个句子，我都想按句子顺序从concepts中提取sentences，并将其写入输出。

例如，我的python程序如下所示。

import re

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems', 
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process', 
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

output = []
counting = 0

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall

for sentence in sentences:
    output.append(find_all_concepts(sentence))

print(output)

输出为； [['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process']]

但是，输出顺序对我而言并不重要。也就是说，我的输出也可能如下所示（换句话说，output内的列表可以被改组）。

[['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

[['data mining', 'knowledge discovery', 'databases process'], ['data mining', 'interdisciplinary subfield', 'information', 'information'], ['data mining', 'process', 'patterns', 'methods', 'machine learning', 'database systems']]

但是，由于我的sentences和concepts的长度，该程序仍然很慢。

是否可以在python中使用多线程来进一步提高性能（在时间方面）？

Answer 1

多线程是否会带来实际的性能提升，不仅取决于Python的实现和数据量，还取决于执行程序的硬件。在某些情况下，硬件没有优势，多线程可能会由于增加的开销而最终减慢速度。

但是，假设您在现代标准PC或更高版本的PC上运行，则多线程可能会有所改进。然后的问题是要设置一些工人，将工作交给他们并收集结果。

与示例结构，实现和命名保持紧密联系：

import re
import queue
import threading

sentences = ['data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
             'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
             'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, l_out):
    while True:
        sentence = q_in.get()
        l_out.append(find_all_concepts(sentence))
        q_in.task_done()


# Queue with default maxsize of 0, infinite queue size
sentences_q = queue.Queue()
output = []

# any reasonable number of workers
num_threads = 2
for i in range(num_threads):
    worker = threading.Thread(target=do_find_all_concepts, args=(sentences_q, output))
    # once there's nothing but daemon threads left, Python exits the program
    worker.daemon = True
    worker.start()

# put all the input on the queue
for s in sentences:
    sentences_q.put(s)

# wait for the entire queue to be processed
sentences_q.join()
print(output)

用户@wwii询问了多个线程并没有真正影响CPU绑定问题的性能。除了使用多个线程来访问相同的输出变量，您还可以使用多个进程来访问共享的输出队列，如下所示：

import re
import queue
import multiprocessing

sentences = [
    'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems',
    'data mining is an interdisciplinary subfield of computer science and statistics with an overall goal to extract information from a data set and transform the information into a comprehensible structure for further use',
    'data mining is the analysis step of the knowledge discovery in databases process or kdd']

concepts = ['data mining', 'database systems', 'databases process',
            'interdisciplinary subfield', 'information', 'knowledge discovery',
            'methods', 'machine learning', 'patterns', 'process']

re_concepts = [re.escape(t) for t in concepts]

find_all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL).findall


def do_find_all_concepts(q_in, q_out):
    try:
        while True:
            sentence = q_in.get(False)
            q_out.put(find_all_concepts(sentence))
    except queue.Empty:
        pass


if __name__ == '__main__':
    # default maxsize of 0, infinite queue size
    sentences_q = multiprocessing.Queue()
    output_q = multiprocessing.Queue()

    # any reasonable number of workers
    num_processes = 2
    pool = multiprocessing.Pool(num_processes, do_find_all_concepts, (sentences_q, output_q))

    # put all the input on the queue
    for s in sentences:
        sentences_q.put(s)

    # wait for the entire queue to be processed
    pool.close()
    pool.join()
    while not output_q.empty():
        print(output_q.get())

仍然需要更多的开销，但是也会使用其他内核上可用的CPU资源。

Answer 2

这里有两个使用concurrent.futures。ProcessPoolExecutor的解决方案，它们会将任务分配到不同的进程。您的任务似乎受cpu约束，而不受I / O约束，因此线程可能无济于事。

import re
import concurrent.futures

# using the lists in your example

re_concepts = [re.escape(t) for t in concepts]
all_concepts = re.compile('|'.join(re_concepts), flags=re.DOTALL)

def f(sequence, regex=all_concepts):
    result = regex.findall(sequence)
    return result

if __name__ == '__main__':

    out1 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        futures = [executor.submit(f, s) for s in sentences]
        for future in concurrent.futures.as_completed(futures):
            try:
                result = future.result()
            except Exception as e:
                print(e)
            else:
                #print(result)
                out1.append(result)   

    out2 = []
    with concurrent.futures.ProcessPoolExecutor() as executor:
        for result in executor.map(f, sentences):
            #print(result)
            out2.append(result)

Executor.map()有一个chunksize参数：the docs说，发送大于一个可迭代项的块可能是有益的。该功能将需要重构以解决此问题。我用一个仅返回所发送内容的函数进行了测试，但是无论我指定的块大小如何，测试函数仅返回单个项目。 ¿去吧？

def h(sequence):
    return sequence

Multiprocessing的一个缺点是必须对数据进行序列化/提取以将其发送到该过程，这需要时间，并且对于大的已编译正则表达式来说可能很重要-可能会破坏从中获得的收益多个过程。

我制作了一组13e6个随机字符串，每个字符串包含20个字符，以近似于您编译的正则表达式。

data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))

对io.BytesIO流进行酸洗大约需要7.5秒，而从io.BytesIO流中酸洗则需要9秒。如果使用多处理解决方案，将概念对象（以任何形式）酸洗到硬盘驱动器上，然后让每个进程从硬盘驱动器上酸洗，而不是每次新进程在IPC的每一侧酸洗/酸洗，将是有益的。创建后，绝对值得测试-YMMV。我的硬盘驱动器上的腌制集为380 MB。

当我尝试使用parallel.futures.ProcessPoolExecutor进行一些实验时，我一直在炸毁我的计算机，因为每个进程都需要它自己的集合副本，而我的计算机没有足够的内存。

我将发布另一个答案，涉及句子中概念测试的方法。

Answer 3

此答案将解决不使用并发提高性能的问题。

构造搜索的方式是，每个句子中要查找1300万个独特的内容。您说每个句子需要3-5分钟，import React, { Component } from "react"; import { Platform, StyleSheet, Dimensions, Text, View, ScrollView, Animated, } from "react-native"; const instructions = Platform.select({ ios: "Press Cmd+R to reload,\n" + "Cmd+D or shake for dev menu", android: "Double tap R on your keyboard to reload,\n" + "Shake or press menu button for dev menu" }); type Props = {}; export default class App extends Component<Props> { state = { scrollY: new Animated.Value(0) }; render() { let { scrollY } = this.state; return ( <View style={{ flex: 1, backgroundColor: "#fff" }}> <ScrollView style={{ paddingHorizontal: 20, width: "100%", height: "90%", paddingBottom: "5%" }} bounces={false} showsVerticalScrollIndicator={true} > <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> <View style={styles.row}> <Text> Test </Text> </View> </ScrollView> </View> ); } } const styles = StyleSheet.create({ row: { margin: 2, height: 40, width: Dimensions.get("window").width - 10, justifyContent: "center", alignItems: "center", backgroundColor: "#e5e5e5" }, container: { // marginTop: 50, flex: 1, justifyContent: "center", alignItems: "center", backgroundColor: "#F5FCFF" }, contentContainer: { paddingVertical: 0 }, welcome: { fontSize: 20, textAlign: "center", margin: 10 }, instructions: { textAlign: "center", color: "#333333", marginBottom: 5 } });中的单词长度从1到10不等。

我认为您可以通过制作一组concepts（最初是构造时或从您的列表中）来缩短搜索时间，然后将每个句子分成一到十个字符串（连续））字词，并测试集合中的成员资格。

分为4个单词字符串的句子示例：

concepts

过程：

'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'
# becomes
[('data', 'mining', 'is', 'the'),
 ('mining', 'is', 'the', 'process'),
 ('is', 'the', 'process', 'of'),
 ('the', 'process', 'of', 'discovering'),
 ('process', 'of', 'discovering', 'patterns'),
 ('of', 'discovering', 'patterns', 'in'),
 ('discovering', 'patterns', 'in', 'large'),
 ('patterns', 'in', 'large', 'data'),
 ('in', 'large', 'data', 'sets'),
 ('large', 'data', 'sets', 'involving'),
 ('data', 'sets', 'involving', 'methods'),
 ('sets', 'involving', 'methods', 'at'),
 ('involving', 'methods', 'at', 'the'),
 ('methods', 'at', 'the', 'intersection'),
 ('at', 'the', 'intersection', 'of'),
 ('the', 'intersection', 'of', 'machine'),
 ('intersection', 'of', 'machine', 'learning'),
 ('of', 'machine', 'learning', 'statistics'),
 ('machine', 'learning', 'statistics', 'and'),
 ('learning', 'statistics', 'and', 'database'),
 ('statistics', 'and', 'database', 'systems')]

（成对地）调整itertools配方，您可以自动化从句子中生成n字字符串的过程：

concepts = set(concepts)
sentence = sentence.split()
#one word
for meme in sentence:
    if meme in concepts:
        #keep it
#two words
for meme in zip(sentence,sentence[1:]):
    if ' '.join(meme) in concepts:
        #keep it
#three words
for meme in zip(sentence,sentence[1:],sentence[2:]):
    if ' '.join(meme) in concepts:
        #keep it

测试每个句子就像这样

from itertools import tee
def nwise(iterable, n=2):
    "s -> (s0,s1), (s1,s2), (s2, s3), ... for n=2"
    iterables = tee(iterable, n)
    # advance each iterable to the appropriate starting point
    for i, thing in enumerate(iterables[1:],1):
        for _ in range(i):
            next(thing, None)
    return zip(*iterables)

我制作了一组13e6个随机字符串，每个字符串包含20个字符，以近似sentence = sentence.strip().split() for n in [1,2,3,4,5,6,7,8,9,10]: for meme in nwise(sentence,n): if ' '.join(meme) in concepts: #keep meme。

concepts

在import random, string data =set(''.join(random.choice(string.printable) for _ in range(20)) for _ in range(13000000))中测试四个或四十个字符串作为成员资格大约需要60纳秒。一百个单词的句子包含955个一到十个单词字符串，因此搜索该句子大约需要60微秒。

示例data中的第一句话有195个可能的概念（一到十个字串）。以下两个功能的计时大致相同：'data mining is the process of discovering patterns in large data sets involving methods at the intersection of machine learning statistics and database systems'约为140微秒，f约为150微秒：

所以这些只是近似值，因为我没有使用您的实际数据，但是它可以使速度大大提高。

在对您的示例数据进行测试后，我发现def f(sentence, data=data, nwise=nwise): '''iterate over memes in sentence and see if they are in data''' sentence = sentence.strip().split() found = [] for n in [1,2,3,4,5,6,7,8,9,10]: for meme in nwise(sentence,n): meme = ' '.join(meme) if meme in data: found.append(meme) return found def g(sentence, data=data, nwise=nwise): 'make a set of the memes in sentence then find its intersection with data''' sentence = sentence.strip().split() test_strings = set(' '.join(meme) for n in range(1,11) for meme in nwise(sentence,n)) found = test_strings.intersection(data) return found不能在一个句子中出现两次概念。

因此，这里的所有内容都与按照每个句子中出现的顺序列出的概念结合在一起。 g的新版本将花费更长的时间，但是增加的时间应该相对较少。如果可能的话，您会发表评论让我知道它比原始评论多了吗？（我很好奇）。

如何使用线程来提高python

3 个答案: