Question

最近，我偶然发现了一个面试问题，你需要编写一个针对ARM优化的代码，尤其是针对iphone的代码：

编写一个函数，它接受一个char（ASCII符号）数组并查找   最常见的角色。

char mostFrequentCharacter(char* str, int size)

该功能应优化为在基于ARM的双核上运行   处理器和无限量的内存。

从表面上看，问题本身看起来非常简单，这是我脑海中出现的函数的简单实现：

#define RESULT_SIZE 127

inline int set_char(char c, int result[])
{
    int count = result[c];
    result[c] = ++count;
    return count;
}

char mostFrequentChar(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char current_char;
    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(size_t i = 0; i<size; i++)
    {
        current_char = str[i];
        current_char_frequency = set_char(current_char, result);

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = current_char;
        }
    }

    return frequent_char;
}

首先，我做了一些基本的代码优化;我将代码（每次迭代计算最频繁的char）移动到另一个for循环并且速度显着提高，而不是评估以下代码块size次

if(current_char_frequency >= char_frequency)
{
    char_frequency = current_char_frequency;
    frequent_char = current_char;
}

我们可以在O(RESULT_SIZE) RESULT_SIZE == 127找到最常见的字符。

char mostFrequentCharOpt1(char str[], int size)
{
    int result[RESULT_SIZE] = {0};

    char frequent_char = '\0';

    int current_char_frequency = 0;
    int char_frequency = 0;

    for(int i = 0; i<size; i++)
    {
        set_char(str[i], result);
    }

    for(int i = 0; i<RESULT_SIZE; i++)
    {
        current_char_frequency = result[i];

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    return frequent_char;
}

基准：iPhone 5s

size = 1000000
iterations = 500

// seconds = 7.842381
char mostFrequentChar(char str[], int size)

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

平均而言，mostFrequentCharOpt1比基本实现快〜24％。

类型优化

ARM内核寄存器长度为32位。因此，将类型为char的所有局部变量更改为int类型会阻止处理器执行其他指令以在每次赋值后考虑局部变量的大小。

注意：ARM64提供31个寄存器（x0-x30），其中每个寄存器为64位宽，并且还具有32位格式（w0-w30）。因此，无需对int数据类型进行操作。 infocenter.arm.com - ARMv8 Registers

在比较汇编语言版本中的函数时，我注意到ARM与int类型和char类型的工作方式之间存在差异。 ARM使用LDRB指令加载字节和STRB指令，将字节存储到存储器中的单个字节中。因此，从我的观点来看，LDRB比LDR慢一点，因为每次访问存储器并加载寄存器时LDRB都会进行零扩展。换句话说，我们不能只将一个字节加载到32位寄存器中，我们应该逐字节转换。

基准：iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.905090
char mostFrequentCharOpt1(char str[], int size)

// seconds = 5.874684
int mostFrequentCharOpt2(char str[], int size)

将char类型更改为int并没有让我在iPhone 5上显着提高速度，相比之下，在iPhone 4上运行相同的代码会产生不同的结果：< / p>

基准测试：iPhone 4

size = 1000000
iterations = 500

// seconds = 28.853877
char mostFrequentCharOpt1(char str[], int size)

// seconds = 27.328955
int mostFrequentCharOpt2(char str[], int size)

循环优化

接下来，我做了一个循环优化，其中，我减少了它，而不是递增i值。

before    
for(int i = 0; i<size; i++) { ... }

after
for(int i = size; i--) { ... }

再次，比较汇编代码，我明确区分了两种方法。

mostFrequentCharOpt2                                              |      mostFrequentCharOpt3
0x10001250c <+88>:  ldr    w8, [sp, #28] ; w8 = i                 |      0x100012694 <+92>:  ldr    w8, [sp, #28]                                             ; w8 = i
0x100012510 <+92>:  ldr    w9, [sp, #44] ; w9 = size              |      0x100012698 <+96>:  sub    w9, w8, #1 ; w9 = i - 1                                           
0x100012514 <+96>:  cmp    w8, w9 ; if i<size                     |      0x10001269c <+100>: str    w9, [sp, #28] ; save w9 to memmory
0x100012518 <+100>: b.ge   0x100012548 ; if true => end loop      |      0x1000126a0 <+104>: cbz    w8, 0x1000126c4 ; compare w8 with 0 and if w8 == 0 => go to 0x1000126c4
0x10001251c <+104>: ... set_char start routine                    |      0x1000126a4 <+108>: ... set_char start routine
...                                                               |      ...
0x100012534 <+128>: ... set_char end routine                      |      0x1000126bc <+132>: ... set_char end routine
0x100012538 <+132>: ldr    w8, [sp, #28] ; w8 = i                 |      0x1000126c0 <+136>: b      0x100012694 ; back to the first line
0x10001253c <+136>: add    w8, w8, #1 ; i++                       |      0x1000126c4 <+140>: ...
0x100012540 <+140>: str    w8, [sp, #28] ; save i to $sp+28       |      
0x100012544 <+144>: b      0x10001250c ; back to the first line   |      
0x100012548 <+148>: str    ...                                    |

此处，代替从内存中访问size并将其与i变量进行比较，其中i变量正在递增，我们只是递减了i 0x1并将存储i的寄存器与0进行比较。

基准：iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt2(char str[], int size) //Type optimization

// seconds = 5.577797
char mostFrequentCharOpt3(char str[], int size) //Loop otimization

线程优化

准确地阅读问题为我们提供了至少一个优化。这一行..optimized to run on dual-core ARM-based processors ...特别提到了使用pthread或gcd优化代码的提示。

int mostFrequentCharThreadOpt(char str[], int size)
{
    int s;
    int tnum;
    int num_threads = THREAD_COUNT; //by default 2
    struct thread_info *tinfo;

    tinfo = calloc(num_threads, sizeof(struct thread_info));

    if (tinfo == NULL)
        exit(EXIT_FAILURE);

    int minCharCountPerThread = size/num_threads;
    int startIndex = 0;

    for (tnum = num_threads; tnum--;)
    {
        startIndex = minCharCountPerThread*tnum;

        tinfo[tnum].thread_num = tnum + 1;
        tinfo[tnum].startIndex = minCharCountPerThread*tnum;
        tinfo[tnum].str_size = (size - minCharCountPerThread*tnum) >= minCharCountPerThread ? minCharCountPerThread : (size - minCharCountPerThread*(tnum-1));
        tinfo[tnum].str = str;

        s = pthread_create(&tinfo[tnum].thread_id, NULL,
                           (void *(*)(void *))_mostFrequentChar, &tinfo[tnum]);
        if (s != 0)
            exit(EXIT_FAILURE);
    }

    int frequent_char = 0;
    int char_frequency = 0;
    int current_char_frequency = 0;

    for (tnum = num_threads; tnum--; )
    {
        s = pthread_join(tinfo[tnum].thread_id, NULL);
    }

    for(int i = RESULT_SIZE; i--; )
    {
        current_char_frequency = 0;

        for (int z = num_threads; z--;)
        {
            current_char_frequency += tinfo[z].resultArray[i];
        }

        if(current_char_frequency >= char_frequency)
        {
            char_frequency = current_char_frequency;
            frequent_char = i;
        }
    }

    free(tinfo);

    return frequent_char;
}

基准：iPhone 5s

size = 1000000
iterations = 500

// seconds = 5.874684
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 3.758042
// THREAD_COUNT = 2
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

注意：MostFrequentCharThreadOpt的工作速度比iPhone 4上的mostFrequentCharOpt2慢。

基准测试：iPhone 4

size = 1000000
iterations = 500

// seconds = 25.819347
char mostFrequentCharOpt3(char str[], int size) //Loop optimization

// seconds = 31.541066
char mostFrequentCharThreadOpt(char str[], int size) //Thread otimization

问题

mostFrequentCharOpt3 and mostFrequentCharThreadOpt的优化程度如何：换句话说：是否有其他方法可以优化这两种方法？

Source code

Answer 1

好吧，您可以尝试以下内容，我不能100％说出在您的情况下会有什么效果，但是根据经验，如果您放下所有可能的优化，并且看起来甚至循环优化对你有用：你的编译器非常麻木。

稍微取决于你的THREAD_COUNT，默认情况下它是2，但如果你是100％，你可以节省一些时间。你知道你工作的平台，不要＆＃如果速度是你的首要任务，那就没有任何理由让任何事情变得动态。

如果THREAD == 2，num_threads是一个不必要的变量，可以删除。

int minCharCountPerThread = size/num_threads;

许多讨论有关位移的话题的旧方法，试试吧：

int minCharCountPerThread = size >> 1; //divide by 2

接下来你可以尝试展开你的循环：多个循环只使用2次，如果大小不是问题，为什么不删除循环方面？这真的是你应该尝试的，看看会发生什么，以及它是否也有用。我已经看到案例循环展开工作很好，我已经看到案例循环展开会减慢我的代码。

最后一件事：如果unsigned / signed（除非您确实需要签名），请尝试使用int号码。众所周知，一些技巧/指令仅适用于无符号变量。

Answer 2

您可以做很多事情，但结果将取决于运行代码的特定ARM硬件。例如，较旧的iPhone硬件与较新的64位设备完全不同。完全不同的硬件拱和diff指令集。较旧的32位arm硬件包含一些真正的“技巧”，可以使事情变得更快，就像多次寄存器读/写操作一样。一个示例优化，而不是加载32位字加载的字节，然后使用位移操作寄存器中的每个字节。如果您使用2个线程，那么另一种方法可以是分解内存访问，以便1个内存页面由1个线程处理，然后第二个线程在第2个内存页面上运行，依此类推。这样，不同处理器中的不同寄存器可以在不读取或写入相同存储器页面的情况下进行最大程度的处理（并且存储器访问通常是缓慢的部分）。我还建议您从一个良好的时序框架开始，我构建了一个timing framework for ARM+iOS，您可能会发现它对此有用。

优化基于ARM的设备的C代码

2 个答案: