Question

我试图生成一个包含不同＆＃34;分组的报告＆＃34;数据的。对于每一个我必须以不同的方式查询postgres并应用不同的逻辑，这可能需要相当长的时间（约1小时）。

为了提高性能，我为每个任务创建了一个线程，每个线程都有自己的连接，因为psycopg2按连接顺序执行查询。我使用numpy来计算一部分数据的中位数和平均值（每组之间很常见）。

我的代码的简短示例如下：

# -*- coding: utf-8 -*-

from postgres import Connection
from lookup import Lookup
from queries import QUERY1, QUERY2
from threading import Thread

class Report(object):

    def __init__(self, **credentials):
        self.conn = self.__get_conn(**credentials)
        self._lookup = Lookup(self.conn)
        self.data = {}

    def __get_conn(self, **credentials):
        return Connection(**credentials)

    def _get_averages(self, data):
        return {
            'mean' : numpy.mean(data),
            'median' : numpy.median(data)
        }

    def method1(self):
        conn = self.__get_conn()
        cursor = conn.get_cursor()
        data = cursor.execute(QUERY1)

        for row in data:
            # Logic specific to the results returned by the query.
            row['arg1'] = self._lookup.find_data_by_method_1(row)
            avgs = self._get_averages(row['data'])
            row['mean'] = avgs['mean']
            row['median'] = avgs['median']

        return data

    def method2(self):
        conn = self.__get_conn()
        cursor = conn.get_cursor()
        data = cursor.execute(QUERY2)

        for row in data:
            # Logic specific to the results returned by the query.
            row['arg2'] = self._lookup.find_data_by_method_2(row)
            avgs = self._get_averages(row['data'])
            row['mean'] = avgs['mean']
            row['median'] = avgs['median']

        return  data

    def lookup(self, arg):

        methods = {
            'arg1' : self.method1,
            'arg2' : self.method2
        }

        method = methods(arg)
        self.data[arg] = method()

    def lookup_args(self):
        return self._lookup.find_args()

    def do_something_with_data(self):
        print self.data

def main():

    creds = {
        'host':'host',
        'user':'postgres',
        'database':'mydatabase',
        'password':'mypassword'
    }
    reporter = Report(**creds)

    args = reporter.lookup_args()
    threads = []
    for arg in args:
        thread = Thread(target=reporter.lookup, args=(arg,))
        threads.append(thread)

    for thread in threads:
        thread.start()
    for thread in threads:
        thread.join()

    reporter.do_something_with_data()

导入的Connection类是psycopg2的简单包装器，以便于创建游标并连接到多个postgres数据库。

导入的Lookup类接受Connection实例，用于执行简短查询以查找在合并到较大查询时大幅降低性能的相关数据。

data示例方法接受的

_get_averages是decimal.Decimal个对象的列表。

当我同时运行所有线程时，我得到一个段错误。如果我独立运行每个线程，脚本就会成功完成。

使用gdb我发现numpy是罪魁祸首：

Program received signal SIGSEGV, Segmentation fault.
[Switching to Thread 0x7fffedc8c700 (LWP 10997)]
0x00007ffff2ac33b7 in sortCompare (a=0x2e956298, b=0x2e956390) at numpy/core/src/multiarray/item_selection.c:1045
1045    numpy/core/src/multiarray/item_selection.c: No such file or directory.
        in numpy/core/src/multiarray/item_selection.c

我意识到this bug有numpy，但这似乎只影响包含类实例和其他数字类型的排序列表。我的列表中的对象保证是decimal.Decimal个实例。（是的，我验证了这一点）。

什么可能导致numpy在线程内部使用时导致段错误，但行为与其他情况一样？

Numpy多线程导致分段错误

0 个答案: