Question

我在Linux上使用等待条件和来自pthreads的互斥锁在进程之间进行同步时遇到了一个奇怪的问题。请注意，这不是仅在一个进程中的线程之间。

我的用例是有一个生产者创建资源（在我的情况下是图像），将它们保存到共享内存区域，更新有关资源的一些信息，然后向等待的消费者发出信号。共享内存和元数据部分工作正常，所以我将其排除，问题是信令不能可靠地工作。用例很简单，因为如果消费者错过了一个或两个图像并不重要，如果消费者还没有时间阅读它，生产者基本上只会覆盖一个旧图像。所以等待条件只需要处理唤醒消费者，我不需要任何资源计数或其他数据。

生产者和消费者都有这样的结构：

struct EventData {
    pthread_mutex_t mutexHandle;
    pthread_cond_t  conditionHandle;
};

消费者流程中的一个线程坐下来等待一些事情发生：

pthread_mutex_lock( &eventData->mutexHandle );
pthread_cond_wait( &eventData->conditionHandle, &eventData->mutexHandle );
thread_mutex_unlock( &eventData->mutexHandle );

制作过程在创建图像，将其保存到共享内存并准备让消费者抓取图像时执行此操作：

pthread_mutex_lock( &eventData->mutexHandle );
pthread_cond_signal( &eventData->conditionHandle );

// also tried:
//pthread_cond_broadcast( &eventData->conditionHandle );
pthread_mutex_unlock( &eventData->mutexHandle );

这对我来说非常好，它在某种程度上起作用。制作人可以在没有任何问题的情况下向消费者发出大约100-1000次的信号，消费者醒来，抓取图像并显示它，结果是我可以看到移动的视频。在某些时候，通常大约几百帧，消费者将在pthread_cond_wait（）中冻结并且永远不会返回。制作人仍然愉快地创建图像，调用pthread_cond_signal（）并继续没有问题。消费者还没有完全冻结，只有执行pthread_cond_wait（）的线程，应用程序的其余部分运行没有问题。

因此，某些东西会导致信号在另一个进程中从一个线程移动到另一个线程时丢失。在消费者冻结之前通常需要5-20秒，并且醒来的次数也在100到1000之间变化（基于到目前为止看到的值）。

由于互斥锁和等待条件默认情况下在进程之间共享并不容易，我使用此设置来创建基元：

    EventData * eventData;

    int fd = open( tmpnam(NULL), O_RDWR | O_CREAT | O_EXCL, 0666);
    if (fd < 0) {
        // failed to open file for event
    }

    if ( ftruncate(fd, sizeof (eventData )) < 0 ) {
        // failed to truncate file
    }

    // setup attributes to allow sharing between processes
    pthread_condattr_init( &conditionAttribute );
    pthread_condattr_setpshared( &conditionAttribute, PTHREAD_PROCESS_SHARED );
    pthread_mutexattr_init( &mutexAttribute );
    pthread_mutexattr_setpshared( &mutexAttribute, PTHREAD_PROCESS_SHARED );

    // map memory for the event struct
    eventData = (EventData *) mmap(NULL, sizeof(EventData), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
    close (fd);

    // finally initialize the memory
    pthread_mutex_init( &eventData->mutexHandle, &mutexAttribute );
    pthread_cond_init( &eventData->conditionHandle, &conditionAttribute );

以上是由创建互斥和等待条件的一方完成的。该文件的名称，即tmpnam（NULL）实际上已保存并传递给另一个进程以进行打开：

    int fd = open( nameOfEventFile, O_RDWR, 0666 );
    if (fd < 0) {
        // failed to open file for event
    }

    eventData = (EventData *) mmap( NULL, sizeof(EventData), PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0 );
    close( fd );

我在这里看不到任何错误，并且想要了解哪些内容可能出错，尤其是因为它会随机运行一段时间。

Answer 1

一旦我写了95％的问题，错误引起了我的注意......我仍然决定将它与修复一起放在这里以防万一其他人偶然发现类似的东西。创建互斥锁和等待条件的部分如下所示：

EventData * eventData;

int fd = open( tmpnam(NULL), O_RDWR | O_CREAT | O_EXCL, 0666);
if (fd < 0) {
    // failed to open file for event
}

if ( ftruncate(fd, sizeof (eventData )) < 0 ) {
    // failed to truncate file
}

如果你仔细观察，你会看到ftruncate（）截断为eventData指针的大小，而不是struct EventData的大小。所以，这里需要的一个字符修复是：

if ( ftruncate(fd, sizeof (EventData )) < 0 ) {
    // failed to truncate file
}

确实是愚蠢的错误。

在多进程设置中等待条件的线程未被唤醒

1 个答案: