Question

我有一个用C编写的数字运算应用程序。它是一种主循环，对于每个值调用，增加“i”的值，执行一些计算的函数。我读到了多线程，我正在考虑在C中学习一点。我想知道像我这样的某些通用代码是否可以自动多线程以及如何。

由于

Pd积。为了了解我的代码，让我们说它是这样的：

main(...)
for(i=0;i<=ntimes;i++)get_result(x[i],y[i],result[i]);

...

void get_result(float x,float y,float result){
  result=sqrt(log (x) + log (y) + cos (exp (x + y));
(and some more similar mathematical operations)
}

Answer 1

如果任务是高度可并行化的并且您的编译器是现代的，那么您可以尝试OpenMP。 http://en.wikipedia.org/wiki/OpenMP

Answer 2

多线程代码的一种替代方法是使用 pthreads （提供比OpenMP更精确的控制）。

假设x，y＆amp; result是全局变量数组，

#include <pthread.h>

...

void *get_result(void *param)  // param is a dummy pointer
{
...
}

int main()
{
...
pthread_t *tid = malloc( ntimes * sizeof(pthread_t) );

for( i=0; i<ntimes; i++ ) 
    pthread_create( &tid[i], NULL, get_result, NULL );

... // do some tasks unrelated to result    

for( i=0; i<ntimes; i++ ) 
    pthread_join( tid[i], NULL );
...
}

（使用gcc prog.c -lpthread编译代码）

Answer 3

你应该看一下openMP。此页面上的C / C ++示例与您的代码类似： https://computing.llnl.gov/tutorials/openMP/#SECTIONS

#include <omp.h>
#define N     1000

main ()
{

int i;
float a[N], b[N], c[N], d[N];

/* Some initializations */
for (i=0; i < N; i++) {
  a[i] = i * 1.5;
  b[i] = i + 22.35;
  }

#pragma omp parallel shared(a,b,c,d) private(i)
  {

  #pragma omp sections nowait
    {

    #pragma omp section
    for (i=0; i < N; i++)
      c[i] = a[i] + b[i];

    #pragma omp section
    for (i=0; i < N; i++)
      d[i] = a[i] * b[i];

    }  /* end of sections */

  }  /* end of parallel section */

}

如果您不想使用openMP，可以直接使用pthreads或clone / wait。

无论您选择哪种路由，您只需将数组划分为每个线程将处理的块。如果你的所有处理都是纯粹的计算（正如你的示例函数所建议的那样），那么你应该只拥有与逻辑处理器一样多的线程。

添加线程进行并行处理会产生一些开销，因此请确保为每个线程提供足够的工作来弥补它。通常你会这样做，但如果每个线程最终只进行1次计算，并且计算并不困难，那么你实际上可能会减慢速度。如果是这种情况，您总是可以拥有比处理器少的线程。

如果你的工作中确实有一些IO，那么你可能会发现拥有比处理器更多的线程是一种胜利，因为当一个线程可能阻塞等待某些IO完成时，另一个线程可以进行其计算。但是，您必须小心将IO写入线程中的同一文件。

Answer 4

如果你希望为某种科学计算或类似的单循环提供并发性，OpenMP as @Novikov说真的是你最好的选择;这就是它的设计目标。

如果您希望了解更常见的方法，您通常会在使用C语言编写的应用程序中看到...在POSIX上您需要pthread_create()等。我不确定你的背景可能与其他语言的并发性有什么关系，但在深入研究之前，你会想要很好地了解你的同步原语（互斥，信号量等），以及了解你何时会需要使用它们。该主题可以是一整本书或一组SO问题本身。

Answer 5

根据操作系统的不同，您可以使用posix线程。您可以使用状态机实现无堆栈多线程。 Keith E. Curtis有一本非常好的书，名为“嵌入式多任务处理”。它只是一套精心设计的switch case语句。效果很好，我已经在苹果麦克风，兔子半导体，AVR，PC上使用它。

瓦利

Answer 6

学习任何语言的并发编程的一个很好的练习是在线程池实现上工作在此模式中，您可以提前创建一些线程。这些线程被视为资源。线程池对象/结构用于将用户定义的任务分配给那些线程以供执行。任务完成后，您可以收集结果。您可以将线程池用作并发的通用设计模式。主要想法可能类似于

#define number_of_threads_to_be_created 42
// create some user defined tasks
Tasks_list_t* task_list_elem = CreateTasks();
// Create the thread pool with 42 tasks
Thpool_handle_t* pool = Create_pool(number_of_threads_to_be_created);

// populate the thread pool with tasks
for ( ; task_list_elem; task_list_elem = task_list_elem->next) {
   add_a_task_to_thpool (task_list_elem, pool);
}
// kick start the thread pool
thpool_run (pool);

// Now decide on the mechanism for collecting the results from tasks list.
// Some of the candidates are:
// 1. sleep till all is done (naive)
// 2. pool the tasks in the list for some state variable describing that the task has
//    finished. This can work quite well in some situations
// 3. Implement signal/callback mechanism that a task can use to signal that it has 
//    finished executing.

应选择从任务中收集数据的机制以及池中使用的线程数量，以反映您的要求以及硬件和运行时环境的功能。
另请注意，此模式并未说明如何将您的任务与其他/外部环境“同步”。错误处理也可能有点棘手（例如：当一个任务失败时该怎么办）。这两个方面需要事先考虑 - 它们可以限制线程池模式的使用。

关于线程池：
http://en.wikipedia.org/wiki/Thread_pool_pattern
http://docs.oracle.com/cd/E19253-01/816-5137/ggedn/index.html

关于pthreads开展的好文献：
http://www.advancedlinuxprogramming.com/alp-folder/alp-ch04-threads.pdf

Answer 7

英特尔的C ++编译器实际上能够自动对代码进行并行化。它只是一个需要启用的编译器开关。它不像OpenMP那样有效（即它并不总是成功或导致程序变慢）。来自英特尔的网站： “自动并行化由-parallel（Linux * OS和Mac OS * X）或/ Qparallel（Windows * OS）选项触发，自动识别包含并行性的循环结构。在编译期间，编译器会自动尝试解构代码序列分成独立的线程以进行并行处理。程序员不需要其他任何工作。“

Answer 8

专门解决OP问题的“自动多线程”部分：

关于如何编写并行性的一个非常有趣的观点被设计成麻省理工学院发明的一种名为Cilk Plus的语言，现在由英特尔拥有。引用维基百科，其想法是

“程序员应该负责为了暴露并行性，识别可以安全的元素并行执行;这应该然后留给运行时环境，特别是调度程序，在执行期间决定如何实际划分工作处理器之间。“

Cilk Plus是标准C ++的超集。它只包含一些额外的关键字（_Cilk_spawn，_Cilk_sync和_Cilk_for），允许程序员将其程序的某些部分标记为可并行化。程序员不要求任何代码都在新线程上运行，他们只是允许轻量级运行时调度程序生成一个新线程当且仅当它实际上是正确的在特定的运行时条件下要做的事情。

要使用Cilk Plus，只需将其关键字添加到您的代码中，然后使用Intel's C++ compiler进行构建。

Answer 9

如果那是你的问题，你的代码不会被编译器自动多线程。请注意，C标准本身对多线程一无所知，因为您是否可以使用多线程不依赖于您用于编码的语言，而是依赖于您编写的目标平台。用C编写的代码几乎可以运行C编译器所存在的任何东西。 C编译器甚至存在C编译器（几乎完全符合ISO-99）;但是，为了支持多个线程，平台必须具有支持此操作的操作系统，并且通常这意味着必须至少存在某些CPU功能。操作系统几乎只能在软件中进行多线程处理，这将非常慢，并且不会有内存保护，但有可能，但即使在这种情况下，您至少需要可编程中断。

那么如何编写多线程C代码完全取决于目标平台的操作系统。存在POSIX一致系统（OS X，FreeBSD，Linux等）和具有自己的库（Windows）的系统。有些系统不仅仅有库（例如OS X有POSIX库，但是你也可以在C中使用Carbon Thread Manager（虽然我认为它现在很传统）。

当然存在跨平台的线程库，一些现代编译器支持OpenMP，其中编译器将自动构建代码以在您选择的目标平台上创建线程;但是没有多少编译器能够支持它，而那些支持它的编译器通常不是完整的功能。通常，您可以使用POSIX线程获得最广泛的系统支持，通常称为“pthreads”。唯一不支持它的主要平台是Windows，在这里你可以使用免费的第三方库，如this one。还存在其他几个端口（Cygwin有一个肯定）。如果某天有某个UI，可能需要使用wxWidgets或SDL等跨平台库，两者都在所有支持的平台上提供一致的多线程支持。

Answer 10

如果循环中的迭代与之前的迭代无关，那么有一种非常简单的方法：尝试多处理，而不是多线程。

假设您有2个核心且ntimes为100，然后100/2 = 50，因此创建程序的2个版本，其中第一个从0到49迭代，另一个从50到99.同时运行它们，你的核心应该保持相当繁忙。

这是一种非常简单的方法，但您不必混淆线程创建，同步等

Answer 11

您可以使用pthreads在C中执行多线程。这是一个基于pthreads的简单示例。

#include<pthread.h>
#include<stdio.h>

void *mythread1();  //thread prototype
void *mythread2();

int main(){
    pthread_t thread[2];
    //starting the thread
    pthread_create(&thread[0],NULL,mythread1,NULL);
    pthread_create(&thread[1],NULL,mythread2,NULL);
    //waiting for completion
    pthread_join(thread[0],NULL);
    pthread_join(thread[1],NULL);


    return 0;
}

//thread definition
void *mythread1(){
    int i;
    for(i=0;i<5;i++)
        printf("Thread 1 Running\n");
}
void *mythread2(){
    int i;
    for(i=0;i<5;i++)
        printf("Thread 2 Running\n");
}

参考：C program to implement Multithreading-Multithreading in C

Answer 12

我认为所有的答案都缺乏一个具体的例子来实现跨不同函数的线程、传递参数和一些基准：

// NB:  gcc -O3 pthread.c -lpthread && time ./a.out

#include <stdio.h>
#include <stdlib.h>
#include <pthread.h>
#include <string.h>

#define bool    unsigned char
#define true    1
#define false   0

typedef struct my_ptr {
    long n;
    long i;
}   t_my_ptr;

void *sum_primes(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    if (my_ptr->n < 0 ) // handle misused of function
        return (void *)-1;
    bool isPrime[my_ptr->i + 1];
    memset(isPrime, true, my_ptr->i + 1);

    if (my_ptr->n >= 2) { // only one even number can be prime: 2
        my_ptr->n += 2;
    }
    for (long i = 3; i <= my_ptr->i ; i+=2) { // after what only odd numbers can be prime numbers
        if (isPrime[i]) {
            my_ptr->n += i;
        }
        for (long j = i * i; j <= my_ptr->i; j+=i*2) // Eratosthenes' Algo, sieve all multiples of current prime, skipping even numbers.
            isPrime[j] = false;
    }
    //printf("%s: %ld\n", __func__, my_ptr->n); // a) if both 'a' and 'b' activated you will notice that both functions are computed asynchronously.
}

void *sum_square(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    my_ptr->n += (my_ptr->i * my_ptr->i) >> 3;
    //printf("%s: %ld\n", __func__, my_ptr->n); // b) if both 'a' and 'b' activated you will notice that both functions are computed asynchronously.
}

void *sum_add_modulo_three(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    my_ptr->n += my_ptr->i % 3;
}

void *sum_add_modulo_thirteen(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    my_ptr->n += my_ptr->i % 13;
}

void *sum_add_twice(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    my_ptr->n += my_ptr->i + my_ptr->i;
}

void *sum_times_five(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    my_ptr->n += my_ptr->i * 5;
}

void *sum_times_thirteen(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    my_ptr->n += my_ptr->i * 13;
}

void *sum_times_seventeen(void *ptr) {
    t_my_ptr *my_ptr = ptr;
    my_ptr->n += my_ptr->i * 17;
}

#define THREADS_NB 8

int main(void)
{
    pthread_t thread[THREADS_NB];
    void *(*fptr[THREADS_NB]) (void *ptr) =  {sum_primes, sum_square,sum_add_modulo_three, \
    sum_add_modulo_thirteen, sum_add_twice, sum_times_five, sum_times_thirteen, sum_times_seventeen};
    t_my_ptr arg[THREADS_NB];
    memset(arg, 0, sizeof(arg));
    long  iret[THREADS_NB];

    for (volatile long i = 0; i < 100000; i++) {
        //print_sum_primes(&prime_arg);
        //print_sum_square(&square_arg);
        for (int j = 0; j < THREADS_NB; j++) {
            arg[j].i = i;
            //fptr[j](&arg[j]);
            pthread_create( &thread[j], NULL, (void *)fptr[j], &arg[j]); // https://man7.org/linux/man-pages/man3/pthread_create.3.html
        }

        // Wait till threads are complete before main continues. Unless we
        // wait we run the risk of executing an exit which will terminate
        // the process and all threads before the threads have completed.
        for (int j = 0; j < THREADS_NB; j++)
            pthread_join(thread[j], NULL);

        //printf("Thread 1 returns: %ld\n",iret1); // if we care about the return value
    }
    for (int j = 0; j < THREADS_NB; j++)
        printf("Function %d: %ld\n", j, arg[j].n);

    return 0;
}

输出：

Function 0: 15616893616113
Function 1: 41666041650000
Function 2: 99999
Function 3: 599982
Function 4: 9999900000
Function 5: 24999750000
Function 6: 64999350000
Function 7: 84999150000

结论（使用 8 个线程）

没有 pthread 但有优化标志 -O3：9.2sd
使用 pthread 且没有优化标志：31.4sd
使用 pthread 和优化标志 -O3：17.8sd
使用 pthread 和优化标志 -O3 且不使用 pthread_join：2.0sd。但是它没有计算正确的输出，因为不同的线程尝试同时访问 my_ptr->i。

为什么多线程会更慢？很简单，启动一个线程的周期成本很高，所以你必须确保你的函数是相当复杂的。第一个基准测试略有偏差，因为不同的函数很容易计算。

结论（使用 8 个线程），用 sum_primes 替换每个函数的内容（用更高级的计算来衡量好处）

没有 pthread 但有自动矢量化 (-O3)：1mn14.4sd
使用 pthread 但没有优化标志：2mn18.6sd
使用 pthread 和自动矢量化 (-O3)：54.7sd
使用 pthread、自动矢量化且不使用 pthread_join：2.8sd。但是它没有计算正确的输出，因为不同的线程尝试同时访问 my_ptr->i。

输出：

Function 0: 15616893616113
Function 1: 15616893616113
Function 2: 15616893616113
Function 3: 15616893616113
Function 4: 15616893616113
Function 5: 15616893616113
Function 6: 15616893616113
Function 7: 15616893616113

这更能代表多线程的真正威力！

最后的话

因此，除非您是具有复杂计算功能的多线程，或者如果您不需要加入线程，由于启动线程和加入线程的成本，这可能不值得。但是，再次对它进行基准测试！

请注意，自动矢量化（通过 -O3 完成）总是会产生显着的积极结果，因为使用 SIMD 没有成本。

NB2：您可以使用 iret[j] = 来存储线程的结果，成功时返回 0。

Answer 13

glibc 2.28中的C11线程。

在Ubuntu 18.04（glibc 2.27）中通过从以下来源编译glibc进行了测试：Multiple glibc libraries on a single host

示例来自：https://en.cppreference.com/w/c/language/atomic

#include <stdio.h>
#include <threads.h>
#include <stdatomic.h>

atomic_int acnt;
int cnt;

int f(void* thr_data)
{
    for(int n = 0; n < 1000; ++n) {
        ++cnt;
        ++acnt;
        // for this example, relaxed memory order is sufficient, e.g.
        // atomic_fetch_add_explicit(&acnt, 1, memory_order_relaxed);
    }
    return 0;
}

int main(void)
{
    thrd_t thr[10];
    for(int n = 0; n < 10; ++n)
        thrd_create(&thr[n], f, NULL);
    for(int n = 0; n < 10; ++n)
        thrd_join(thr[n], NULL);

    printf("The atomic counter is %u\n", acnt);
    printf("The non-atomic counter is %u\n", cnt);
}

编译并运行：

gcc -std=c11 main.c -pthread
./a.out

可能的输出：

The atomic counter is 10000
The non-atomic counter is 8644

由于跨线程访问非原子变量，非原子计数器很可能小于原子计数器。

TODO：反汇编并查看++acnt;编译成什么。

POSIX线程

#define _XOPEN_SOURCE 700
#include <assert.h>
#include <stdlib.h>
#include <pthread.h>

enum CONSTANTS {
    NUM_THREADS = 1000,
    NUM_ITERS = 1000
};

int global = 0;
int fail = 0;
pthread_mutex_t main_thread_mutex = PTHREAD_MUTEX_INITIALIZER;

void* main_thread(void *arg) {
    int i;
    for (i = 0; i < NUM_ITERS; ++i) {
        if (!fail)
            pthread_mutex_lock(&main_thread_mutex);
        global++;
        if (!fail)
            pthread_mutex_unlock(&main_thread_mutex);
    }
    return NULL;
}

int main(int argc, char **argv) {
    pthread_t threads[NUM_THREADS];
    int i;
    fail = argc > 1;
    for (i = 0; i < NUM_THREADS; ++i)
        pthread_create(&threads[i], NULL, main_thread, NULL);
    for (i = 0; i < NUM_THREADS; ++i)
        pthread_join(threads[i], NULL);
    assert(global == NUM_THREADS * NUM_ITERS);
    return EXIT_SUCCESS;
}

编译并运行：

gcc -std=c99 pthread_mutex.c -pthread
./a.out
./a.out 1

第一次运行正常，第二次由于缺少同步而失败。

在Ubuntu 18.04上测试。 GitHub upstream。

如何“多线程”C代码

13 个答案:

结论（使用 8 个线程）

结论（使用 8 个线程），用 sum_primes 替换每个函数的内容（用更高级的计算来衡量好处）

最后的话