我编写了以下结构,用于我正在制作的Arduino软件PWM库中,一次PWM(在Uno上)或70个引脚(在Mega上)最多PWM。
如上所述,代码的ISR部分(eRCaGuy_SoftwarePWMupdate())处理此结构的数组,需要 133us才能运行。然而,非常奇怪的是,如果我取消注释“byte flags1”这一行, (在结构中)虽然flags1尚未在任何地方使用,但ISR现在需要运行158us 。然后,如果我取消注释“byte flags2;”为了使 BOTH标志现在不被注释,运行时会回落到之前的状态(133us)。
为什么会这样?!我该如何解决? (即:我想确保一致的快速代码,对于这个特定的功能,而不是令人费解的变幻无常的代码)。添加一个字节会大大减慢代码速度,但添加两个字节根本不会发生任何变化。
我正在尝试优化代码(我还需要添加另一个功能,需要一个字节用于标志),但我不明白为什么添加一个未使用的字节会使代码减慢25us,但添加两个未使用的bytes根本不会改变运行时。
我需要了解这一点,以确保我的优化始终如一。
在.h文件中(我的原始结构,使用C风格的typedef'ed结构):
typedef struct softPWMpin //global struct
{
//VOLATILE VARIBLES (WILL BE ACCESSED IN AND OUTSIDE OF ISRs)
//for pin write access:
volatile byte pinBitMask;
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
//for PWM output:
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
//byte flags1;
//byte flags2;
//NON-VOLATILE VARIABLES (WILL ONLY BE ACCESSED INSIDE AN ISR, OR OUTSIDE AN ISR, BUT NOT BOTH)
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
} softPWMpin_t;
在.h文件中(新的,使用C ++样式的struct ....根据注释查看它是否有任何区别。它似乎没有任何区别,包括运行时和编译后的大小)
struct softPWMpin //global struct
{
//VOLATILE VARIBLES (WILL BE ACCESSED IN AND OUTSIDE OF ISRs)
//for pin write access:
volatile byte pinBitMask;
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
//for PWM output:
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
//byte flags1;
//byte flags2;
//NON-VOLATILE VARIABLES (WILL ONLY BE ACCESSED INSIDE AN ISR, OR OUTSIDE AN ISR, BUT NOT BOTH)
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
};
在.cpp文件中(这里我创建了结构数组,这里是更新函数,在ISR中通过定时器中断以固定速率调用):
//static softPWMpin_t PWMpins[MAX_NUMBER_SOFTWARE_PWM_PINS]; //C-style, old, MAX_NUMBER_SOFTWARE_PWM_PINS = 20; static to give it file scope only
static softPWMpin PWMpins[MAX_NUMBER_SOFTWARE_PWM_PINS]; //C++-style, old, MAX_NUMBER_SOFTWARE_PWM_PINS = 20; static to give it file scope only
//This function must be placed within an ISR, to be called at a fixed interval
void eRCaGuy_SoftwarePWMupdate()
{
//Forced nonatomic block (ie: interrupts *enabled*)
byte SREG_old = SREG; //[1 clock cycle]
interrupts(); //[1 clock cycle] turn interrupts ON to allow *nested interrupts* (ex: handling of time-sensitive timing, such as reading incoming PWM signals or counting Timer2 overflows)
{
//first, increment all counters of attached pins (ie: where the value != PIN_NOT_ATTACHED)
//pinMapArray
for (byte pin=0; pin<NUM_DIGITAL_PINS; pin++)
{
byte i = pinMapArray[pin]; //[2 clock cycles: 0.125us]; No need to turn off interrupts to read this volatile variable here since reading pinMapArray[pin] is an atomic operation (since it's a single byte)
if (i != PIN_NOT_ATTACHED) //if the pin IS attached, increment counter and decide what to do with pin...
{
//Read volatile variables ONE time, all at once, to optimize code (volatile variables take more time to read [I know] since their values can't be recalled from registers [I believe]).
noInterrupts(); //[1 clock cycle] turn off interrupts to read non-atomic volatile variables that could be updated simultaneously right now in another ISR, since nested interrupts are enabled here
unsigned long resolution = PWMpins[i].resolution;
unsigned long PWMvalue = PWMpins[i].PWMvalue;
volatile byte* p_PORT_out = PWMpins[i].p_PORT_out; //[0.44us raw: 5 clock cycles, 0.3125us]
interrupts(); //[1 clock cycle]
//handle edge cases FIRST (PWMvalue==0 and PMWvalue==topValue), since if an edge case exists we should NOT do the main case handling below
if (PWMvalue==0) //the PWM command is 0% duty cycle
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,LOW); //write LOW [1.19us raw: 17 clock cycles, 1.0625us]
}
else if (PWMvalue==resolution-1) //the PWM command is 100% duty cycle
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,HIGH); //write HIGH [0.88us raw; 12 clock cycles, 0.75us]
}
//THEN handle main cases (PWMvalue is > 0 and < topValue)
else //(0% < PWM command < 100%)
{
PWMpins[i].counter++; //not volatile
if (PWMpins[i].counter >= resolution)
{
PWMpins[i].counter = 0; //reset
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,HIGH);
}
else if (PWMpins[i].counter>=PWMvalue)
{
fastDigitalWrite(p_PORT_out,PWMpins[i].pinBitMask,LOW); //write LOW [1.18us raw: 17 clock cycles, 1.0625us]
}
}
}
}
}
SREG = SREG_old; //restore interrupt enable status
}
我尝试通过对齐属性更改对齐方式。我的编译器是gcc。
以下是我如何修改.h文件中的结构以添加属性(它位于最后一行)。 请注意,我还将struct成员的顺序更改为最大:
struct softPWMpin //C++ style
{
volatile unsigned long resolution;
volatile unsigned long PWMvalue; //Note: duty cycle = PWMvalue/(resolution - 1) = PWMvalue/topValue;
//ex: if resolution is 256, topValue is 255
//if PWMvalue = 255, duty_cycle = PWMvalue/topValue = 255/255 = 1 = 100%
//if PWMvalue = 50, duty_cycle = PWMvalue/topValue = 50/255 = 0.196 = 19.6%
unsigned long counter; //incremented each time update() is called; goes back to zero after reaching topValue; does NOT need to be volatile, since only the update function updates this (it is read-to or written from nowhere else)
volatile byte* volatile p_PORT_out; //pointer to port output register; NB: the 1st "volatile" says the port itself (1 byte) is volatile, the 2nd "volatile" says the *pointer* itself (2 bytes, pointing to the port) is volatile.
volatile byte pinBitMask;
// byte flags1;
// byte flags2;
} __attribute__ ((aligned));
来源:https://gcc.gnu.org/onlinedocs/gcc-3.1/gcc/Type-Attributes.html
这是我到目前为止所尝试的结果:
__attribute__ ((aligned));
__attribute__ ((aligned(1)));
__attribute__ ((aligned(2)));
__attribute__ ((aligned(4)));
__attribute__ ((aligned(8)));
当我添加一个标志字节时,它们似乎都无法解决我看到的问题。当离开标志字节注释掉时,2-8个运行时间超过133us,而对齐1没有差别(运行时间保持133us),暗示它是已经发生的未添加属性的情况一点都不另外,即使我使用2,4,8的对齐选项,sizeof(PWMvalue)函数仍然返回结构中的确切字节数,没有额外的填充。
......仍然不知道发生了什么......
(见下面的评论) 优化级别肯定会产生影响。例如,当我将编译器优化级别从-Os更改为-O2时,基本情况保持在133us(如前所述),取消注释flags1给了我120us(vs 158us),并且取消注释flags1和flags2同时给了我132us(vs 133us) )。这仍然没有回答我的问题,但我至少知道存在优化级别,以及如何更改它们。
上段摘要:
Processing time of (of eRCaGuy_SoftwarePWMupdate() function)
Optimization No flags w/flags1 w/flags1+flags2
Os 133us 158us 133us
O2 132us 120us 132us
Memory Use (bytes: flash/global vars SRAM/sizeof(softPWMpin)/sizeof(PWMpins))
Optimization No flags w/flags1 w/flags1+flags2
Os 4020/591/15/300 3950/611/16/320 4020/631/17/340
O2 4154/591/15/300 4064/611/16/320 4154/631/17/340
gcc编译器优化级别的来源:
- https://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html
- https://gcc.gnu.org/onlinedocs/gnat_ugn/Optimization-Levels.html
- http://www.rapidtables.com/code/linux/gcc/gcc-o.htm
如何在Arduino IDE中更改编译器设置:
- http://www.instructables.com/id/Arduino-IDE-16x-compiler-optimisations-faster-code/
结构包装信息:
- http://www.catb.org/esr/structure-packing/
数据对齐:
- http://www.songho.ca/misc/alignment/dataalign.html
为8位Atmel AVR微控制器编写高效的C代码
- AVR035 AVR的高效C编码 - doc1497 - http://www.atmel.com/images/doc1497.pdf
- AVR4027优化8位AVR微控制器C代码的技巧和窍门 - doc8453 - http://www.atmel.com/images/doc8453.pdf
FOR NO FLAGS(标记为flags1和flags2)和Os优化
构建首选项(来自buildprefs.txt文件,其中Arduino吐出已编译的代码):
对我来说:“C:\ Users \ Gabriel \ AppData \ Local \ Temp \ build8427371380606368699.tmp”
build.arch = AVR
build.board = AVR_UNO
build.core = arduino
build.core.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\cores\arduino
build.extra_flags =
build.f_cpu = 16000000L
build.mcu = atmega328p
build.path = C:\Users\Gabriel\AppData\Local\Temp\build8427371380606368699.tmp
build.project_name = software_PWM_fade13_speed_test2.cpp
build.system.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\system
build.usb_flags = -DUSB_VID={build.vid} -DUSB_PID={build.pid} '-DUSB_MANUFACTURER={build.usb_manufacturer}' '-DUSB_PRODUCT={build.usb_product}'
build.usb_manufacturer =
build.variant = standard
build.variant.path = C:\Program Files (x86)\Arduino\hardware\arduino\avr\variants\standard
build.verbose = true
build.warn_data_percentage = 75
compiler.S.extra_flags =
compiler.S.flags = -c -g -x assembler-with-cpp
compiler.ar.cmd = avr-ar
compiler.ar.extra_flags =
compiler.ar.flags = rcs
compiler.c.cmd = avr-gcc
compiler.c.elf.cmd = avr-gcc
compiler.c.elf.extra_flags =
compiler.c.elf.flags = -w -Os -Wl,--gc-sections
compiler.c.extra_flags =
compiler.c.flags = -c -g -Os -w -ffunction-sections -fdata-sections -MMD
compiler.cpp.cmd = avr-g++
compiler.cpp.extra_flags =
compiler.cpp.flags = -c -g -Os -w -fno-exceptions -ffunction-sections -fdata-sections -fno-threadsafe-statics -MMD
compiler.elf2hex.cmd = avr-objcopy
compiler.elf2hex.extra_flags =
compiler.elf2hex.flags = -O ihex -R .eeprom
compiler.ldflags =
compiler.objcopy.cmd = avr-objcopy
compiler.objcopy.eep.extra_flags =
compiler.objcopy.eep.flags = -O ihex -j .eeprom --set-section-flags=.eeprom=alloc,load --no-change-warnings --change-section-lma .eeprom=0
compiler.path = {runtime.ide.path}/hardware/tools/avr/bin/
compiler.size.cmd = avr-size
大会的一些内容: (Os,没有旗帜):
00000328 <_Z25eRCaGuy_SoftwarePWMupdatev>:
328: 8f 92 push r8
32a: 9f 92 push r9
32c: af 92 push r10
32e: bf 92 push r11
330: cf 92 push r12
332: df 92 push r13
334: ef 92 push r14
336: ff 92 push r15
338: 0f 93 push r16
33a: 1f 93 push r17
33c: cf 93 push r28
33e: df 93 push r29
340: 0f b7 in r16, 0x3f ; 63
342: 78 94 sei
344: 20 e0 ldi r18, 0x00 ; 0
346: 30 e0 ldi r19, 0x00 ; 0
348: 1f e0 ldi r17, 0x0F ; 15
34a: f9 01 movw r30, r18
34c: e8 5a subi r30, 0xA8 ; 168
34e: fe 4f sbci r31, 0xFE ; 254
350: 80 81 ld r24, Z
352: 8f 3f cpi r24, 0xFF ; 255
354: 09 f4 brne .+2 ; 0x358 <_Z25eRCaGuy_SoftwarePWMupdatev+0x30>
356: 67 c0 rjmp .+206 ; 0x426 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfe>
358: f8 94 cli
35a: 90 e0 ldi r25, 0x00 ; 0
35c: 18 9f mul r17, r24
35e: f0 01 movw r30, r0
360: 19 9f mul r17, r25
362: f0 0d add r31, r0
364: 11 24 eor r1, r1
366: e4 59 subi r30, 0x94 ; 148
368: fe 4f sbci r31, 0xFE ; 254
36a: c0 80 ld r12, Z
36c: d1 80 ldd r13, Z+1 ; 0x01
36e: e2 80 ldd r14, Z+2 ; 0x02
370: f3 80 ldd r15, Z+3 ; 0x03
372: 44 81 ldd r20, Z+4 ; 0x04
374: 55 81 ldd r21, Z+5 ; 0x05
376: 66 81 ldd r22, Z+6 ; 0x06
378: 77 81 ldd r23, Z+7 ; 0x07
37a: 04 84 ldd r0, Z+12 ; 0x0c
37c: f5 85 ldd r31, Z+13 ; 0x0d
37e: e0 2d mov r30, r0
380: 78 94 sei
382: 41 15 cp r20, r1
384: 51 05 cpc r21, r1
386: 61 05 cpc r22, r1
388: 71 05 cpc r23, r1
38a: 51 f4 brne .+20 ; 0x3a0 <_Z25eRCaGuy_SoftwarePWMupdatev+0x78>
38c: 18 9f mul r17, r24
38e: d0 01 movw r26, r0
390: 19 9f mul r17, r25
392: b0 0d add r27, r0
394: 11 24 eor r1, r1
396: a4 59 subi r26, 0x94 ; 148
398: be 4f sbci r27, 0xFE ; 254
39a: 1e 96 adiw r26, 0x0e ; 14
39c: 4c 91 ld r20, X
39e: 3b c0 rjmp .+118 ; 0x416 <_Z25eRCaGuy_SoftwarePWMupdatev+0xee>
3a0: 46 01 movw r8, r12
3a2: 57 01 movw r10, r14
3a4: a1 e0 ldi r26, 0x01 ; 1
3a6: 8a 1a sub r8, r26
3a8: 91 08 sbc r9, r1
3aa: a1 08 sbc r10, r1
3ac: b1 08 sbc r11, r1
3ae: 48 15 cp r20, r8
3b0: 59 05 cpc r21, r9
3b2: 6a 05 cpc r22, r10
3b4: 7b 05 cpc r23, r11
3b6: 51 f4 brne .+20 ; 0x3cc <_Z25eRCaGuy_SoftwarePWMupdatev+0xa4>
3b8: 18 9f mul r17, r24
3ba: d0 01 movw r26, r0
3bc: 19 9f mul r17, r25
3be: b0 0d add r27, r0
3c0: 11 24 eor r1, r1
3c2: a4 59 subi r26, 0x94 ; 148
3c4: be 4f sbci r27, 0xFE ; 254
3c6: 1e 96 adiw r26, 0x0e ; 14
3c8: 9c 91 ld r25, X
3ca: 1c c0 rjmp .+56 ; 0x404 <_Z25eRCaGuy_SoftwarePWMupdatev+0xdc>
3cc: 18 9f mul r17, r24
3ce: e0 01 movw r28, r0
3d0: 19 9f mul r17, r25
3d2: d0 0d add r29, r0
3d4: 11 24 eor r1, r1
3d6: c4 59 subi r28, 0x94 ; 148
3d8: de 4f sbci r29, 0xFE ; 254
3da: 88 85 ldd r24, Y+8 ; 0x08
3dc: 99 85 ldd r25, Y+9 ; 0x09
3de: aa 85 ldd r26, Y+10 ; 0x0a
3e0: bb 85 ldd r27, Y+11 ; 0x0b
3e2: 01 96 adiw r24, 0x01 ; 1
3e4: a1 1d adc r26, r1
3e6: b1 1d adc r27, r1
3e8: 88 87 std Y+8, r24 ; 0x08
3ea: 99 87 std Y+9, r25 ; 0x09
3ec: aa 87 std Y+10, r26 ; 0x0a
3ee: bb 87 std Y+11, r27 ; 0x0b
3f0: 8c 15 cp r24, r12
3f2: 9d 05 cpc r25, r13
3f4: ae 05 cpc r26, r14
3f6: bf 05 cpc r27, r15
3f8: 40 f0 brcs .+16 ; 0x40a <_Z25eRCaGuy_SoftwarePWMupdatev+0xe2>
3fa: 18 86 std Y+8, r1 ; 0x08
3fc: 19 86 std Y+9, r1 ; 0x09
3fe: 1a 86 std Y+10, r1 ; 0x0a
400: 1b 86 std Y+11, r1 ; 0x0b
402: 9e 85 ldd r25, Y+14 ; 0x0e
404: 80 81 ld r24, Z
406: 89 2b or r24, r25
408: 0d c0 rjmp .+26 ; 0x424 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfc>
40a: 84 17 cp r24, r20
40c: 95 07 cpc r25, r21
40e: a6 07 cpc r26, r22
410: b7 07 cpc r27, r23
412: 48 f0 brcs .+18 ; 0x426 <_Z25eRCaGuy_SoftwarePWMupdatev+0xfe>
414: 4e 85 ldd r20, Y+14 ; 0x0e
416: 80 81 ld r24, Z
418: 90 e0 ldi r25, 0x00 ; 0
41a: 50 e0 ldi r21, 0x00 ; 0
41c: 40 95 com r20
41e: 50 95 com r21
420: 84 23 and r24, r20
422: 95 23 and r25, r21
424: 80 83 st Z, r24
426: 2f 5f subi r18, 0xFF ; 255
428: 3f 4f sbci r19, 0xFF ; 255
42a: 24 31 cpi r18, 0x14 ; 20
42c: 31 05 cpc r19, r1
42e: 09 f0 breq .+2 ; 0x432 <_Z25eRCaGuy_SoftwarePWMupdatev+0x10a>
430: 8c cf rjmp .-232 ; 0x34a <_Z25eRCaGuy_SoftwarePWMupdatev+0x22>
432: 0f bf out 0x3f, r16 ; 63
434: df 91 pop r29
436: cf 91 pop r28
438: 1f 91 pop r17
43a: 0f 91 pop r16
43c: ff 90 pop r15
43e: ef 90 pop r14
440: df 90 pop r13
442: cf 90 pop r12
444: bf 90 pop r11
446: af 90 pop r10
448: 9f 90 pop r9
44a: 8f 90 pop r8
44c: 08 95 ret
答案 0 :(得分:5)
这几乎肯定是对齐问题。根据结构的大小来判断,编译器似乎自动打包它。
LDR
指令将4字节值加载到寄存器中,并以4字节边界运行。如果它需要加载一个不在4字节边界上的存储器地址,它实际上执行两个加载并将它们组合起来以获得该地址的值。
例如,如果要在0x02
加载4字节值,处理器将执行两次加载,因为0x02
不会落在4字节边界上。
我们假设我们在地址0x00
处有以下内存,我们希望将0x02
的4字节值加载到寄存器r0
中:
Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|0x08|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF | 12 |
------------------------------------------------------
r0: 00 00 00 00
它将首先在0x00
加载4个字节,因为它包含0x02
的4字节段,并将{2}字节存储在0x02
和{{1}在寄存器中:
0x03
然后将加载Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 1 | ** ** |
------------------------------------------------------
r0: 56 78 00 00
的4个字节,这是下一个4字节的段,并将0x04
和0x04
的2个字节存储在寄存器中。
0x05
如您所见,每次要访问Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 2 | ** ** |
------------------------------------------------------
r0: 56 78 90 AB
处的值时,处理器实际上必须将您的指令拆分为两个操作。但是,如果您想要访问0x02
处的值,处理器可以在一次操作中执行此操作:
0x04
在您的示例中,同时注释了Address |0x00|0x01|0x02|0x03|0x04|0x05|0x06|0x07|
Value | 12 | 34 | 56 | 78 | 90 | AB | CD | EF |
Load 1 | ** ** ** ** |
------------------------------------------------------
r0: 90 AB CD EF
和flags1
,结构的大小为15.这意味着数组中的每个第二个结构都将位于一个奇怪的地址,所以它的指针或长成员都不会正确对齐。
通过引入其中一个flags2
变量,结构的大小增加到16,这是4的倍数。这可以确保所有结构都以4字节边界开始,所以你可能不会遇到对齐问题。
可能有一个编译器标志可以帮助你解决这个问题,但总的来说,了解你的结构布局是件好事。对齐是一个棘手的问题,只有符合当前标准的编译器才有明确定义的行为。