Question

I am converting a program from a regular c program to a cuda program and wanted to implement an easy wrapper for malloc that just uses a large pool.

I have 5000 threads. My block size is 1024.

Here is the buffer structure I am using to keep track of each threads memory pool.

typedef struct buffer_t
{
    unsigned long size;
    char* current_index;
    char pool[];
} buffer_t;

As you can imagine I use:

cudaMalloc(&memptr, 262144*5000);

to do the allocation where each thread is suppose to create a buffer on its 262144 bytes

Here are the functions I am using to do the allocations:

__device__ buffer_t* buffer_constructor(size_t size, void* memptr)
{
    buffer_t* buffer = (buffer_t*)memptr;
    buffer->size = size - sizeof(unsigned long) - sizeof(char*);
    buffer->current_index = buffer->pool;
    return buffer;
}
__device__ void* buffer_malloc(buffer_t* buffer, size_t size)
{
    if(size > buffer->size - (buffer->current_index - buffer->pool))
    {
        return NULL;
    }

    void* ptr = buffer->current_index;
    buffer->current_index += size;
    return ptr;
}

Each thread calls:

buffer_t* buffer = buffer_constructor(size, memptr+(tid * size));

So when I run the code it just returns from the kernel at some point. When I run the debugger I get this error:

Program received signal CUDA_EXCEPTION_6, Warp Misaligned Address.
[Switching focus to CUDA kernel 0, grid 1, block (2,0,0), thread (768,0,0), device 0, sm 10, warp 24, lane 0]
0x0000000000b48428 in device_matrix_list_constructor (buffer=<optimized   out>, num=<optimized out>)
    at device_matrix_list.cu:8
8               return list;

When I run memcheck I get a couple of these errors for a couple blocks:

Invalid __global__ write of size 8
=========     at 0x00000258 in    /home/crafton.b/cuda_nn/device_matrix_list.cu:7:device_matrix_list_constructor(buffer_t*, unsigned int)
=========     by thread (897,0,0) in block (4,0,0)
=========     Address 0x235202a0fc is misaligned

Any help is really appreciated I have been struggling with this for a while now

Cuda problems using shared buffer for simulated memory allocation

0 个答案: