使用numba加速图形压缩的python代码

时间:2019-05-21 22:51:22

标签: python parallel-processing

我正在尝试加快我当前使用的python代码的速度。该代码旨在压缩社交图。该代码示例按预期在我的数据上工作,主要的问题是花费的时间。如果不进行任何并行化,则在Slashdot0902(https://snap.stanford.edu/data/soc-Slashdot0902.html)上的代码大约需要4000 s,在soc-epinions1(https://snap.stanford.edu/data/soc-Epinions1.html)上的代码需要3800s才能产生比预期高得多的压缩表示。

我分析了代码,并且大部分时间都花在了主压缩主体上。我正在寻找使代码并行化的方法,即实质上同时运行每个元素的压缩算法(不同元素之间的压缩完全无关)。

尝试使用多进程设置实现此加速将无法实现预期的速度提升。我尝试不成功使用numba实现并行化。我知道我试图传递矩阵(二维数组作为输入)不是应该的方式,但是我被困在这一点上。

这是代码示例。 (此处的完整代码https://pastebin.com/8yihNs2Y

# Import everything

#Load adjacency matrix into into a variable called matrix
# Creates a list to store the final output values before writing into a file
list1 = []

# Finds the parent of current node
@cuda.jit(device=True)
def parent(index):
    return int((index-1) / 2)

# Finds the sibling of current node left or right sibling based on index
@cuda.jit(device=True)
def sibling(index):
    if(index % 2 == 1):
        return index+1
    else:
        return index-1

len_row = matrix.shape[0]
# Find the height of the binary tree and use it to find n i.e the no.of elements in the array
height = int(math.log2(len_row)) + 1
n = (2 ** height) - 1
start_index = n-len_row
temp_array = np.full(n, -1, dtype=np.int8)


@vectorize(['int32(int32,int32,int32,int32,int32)'], target='cuda')
def compress(input_array, temp_array, n, start_index, len_row):
    for i in range(len_row):
        if(input_array[i] == 1):
            current_index = start_index+i
            dcn_reached = False
            while dcn_reached == False:
                temp_array[current_index] = 1
                if(temp_array[parent(current_index)] != 1):
                    temp_array[parent(current_index)] = 1
                    temp_array[sibling(current_index)] = 0
                    current_index = parent(current_index)
                else:
                    dcn_reached = True
    return temp_array


list1.append(compress(matrix, temp_array, n, start_index, len_row))

如果它可以正常工作,我希望有一个带有temp_array值的列表。然后,我将不得不操纵temp_array以获得最终数组,因为np.where(temp_array != -1)在gpu中不起作用。

这是非并行化代码(此处为完整代码https://pastebin.com/BNvVUzV3)的主要部分(请注意,为了使代码正常工作,您可能需要根据以下内容更改sep=' '中的pd.read_csv文件)

# Load edgelist into a 2d list called matrix
list1 = []

# Two functions parent and sibling to return the index of parent and sibling node in a binary tree

start = time.time()
len_row = matrix.shape[0]
# Find the height of the binary tree and use it to find n i.e the no.of elements in the array
height = int(math.log2(len_row)) + 1
n = (2 ** height) - 1

# Loop for lal the elements in row i of the matrix to generate compressed format of that row
for i in range(len_row):
    print("Element "+str(i))
    # Input_array = i'th row of matrix array
    input_array = matrix[i]
    # Initilaize temp array with -1 values
    temp_array = np.full(n, -1, dtype=np.int8)
    start_index = n-len(input_array)
    for i in range(len_row):
        if(input_array[i] == 1):
            current_index = start_index+i
            dcn_reached = False
            while dcn_reached == False:
                temp_array[current_index] = 1
                if(temp_array[parent(current_index)] != 1):
                    temp_array[parent(current_index)] = 1
                    temp_array[sibling(current_index)] = 0
                    current_index = parent(current_index)
                else:
                    dcn_reached = True
    i = np.where(temp_array != -1)
    output = temp_array[i]
    list1.append(output)

# Pass this list through bz2 compression and pickle dump it to a bin file

我想实现以下至少一项

  
      
  1. 使numba代码正常工作
  2.   
  3. 建议采用其他方法来并行化上述代码
  4.   
  5. 进一步优化的建议
  6.   

0 个答案:

没有答案