![]() |
Results of StarCore
Code Optimization |
![]() |
|
||
|
In this article we
demonstrate optimization results achieved by
applying sco to
number of benchmarks and applications. The FrameworkThe study comprises five benchmarks:
All benchmarks were compiled using ccsc100 compiler (version 1.5, build 113000-1633) utilizing all optimizations provided by it (-Ot2). The following is description of how the benchmark foo.c had been processed: ccsc100
-S -Ot2 foo.c
sco -i foo.sl -o foo.slo <sco_options> ccsc100 -S –Ot2 foo.c compiles the input file 'foo.c' and generates assembly output file 'foo.sl'. Note that –Ot2 was specified thus utilizing all optimizations provided by the compiler. sco -i foo.sl -o foo.slo <sco_options> optimizes the input file 'foo.sl' generating as an output file 'foo.slo'. <sco_options> depend on application and will be specified later. Files were foo.sl and foo.slo analyzed and compared. Note, that all code improvements reported are over performance of the code that is maximally optimized by the compiler. Note also that all results were obtained by calling sco with certain parameters (<sco_options>) - other parameter settings could result in even better improvements. ResultsConvolution&CorrelationClassical algorithm that may be found as part of many DSP applications:for
( i = 0; i < N; i++ ) {
ap
= A - I;
} bp = B - J; sum = 0; for ( q = 0; q < P; q++ ) { temp1
= *(ap += I);
}temp2 = *(bp += J); temp1 = temp1 * temp2; sum = sum + temp1; *(C += K) = sum; A += I; Compiler generated code for the innermost loop looks as following: loopstart3
PL003 [
iadd d9,d5 ;[54] 4%=1 move.l (r1)+,d1 ;[53] 0%=0 move.l (r2)+,d4 ;[53] 0%=0 ] [ impyuu d1,d4,d9 ;[54] 1%=0 iadd d5,d3 ;[55] 5%=1 impysu d1,d4,d5 ;[54] 1%=0 ] imacus d1,d4,d5 ;[54] 2%=0 asll #<16,d5 ;[54] 3%=0 loopend3
Using sco with <sco_options> set to –lu 4 –v generates the following equivalent of the above kernel: loopstart3
__dco_vr_7: [
move.l (r2)+,d8 impysu d13,d12,d5 iadd d7,d3 iadd d10,d6 ] [ iadd d6,d3 move.l (r1)+,d6 impyuu d15,d14,d14 asll #16,d2 ] [ impysu d6,d8,d7 move.l (r1)+,d1 move.l (r2)+,d4 iadd d14,d2 ] [ imacus d13,d12,d5 imacus d6,d8,d7 move.l (r1)+,d15 ] [ move.l (r2)+,d14 impyuu d13,d12,d9 asll #16,d5 iadd d2,d3 ] [ impyuu d6,d8,d11 impysu d1,d4,d6 iadd d9,d5 ] [ asll #16,d7 imacus d1,d4,d6 impysu d15,d14,d2 ] [ iadd d5,d3 iadd d11,d7 impyuu d1,d4,d10 asll #16,d6 ] [ imacus d15,d14,d2 move.l (r1)+,d13 move.l (r2)+,d12 ] loopend3 and processes four points in 9 clocks – 44% improvement. top Edge DetectionThis application detects the edges in a 256 gray-level 128 x 128 pixel image and is based on the routines and algorithms found in the book "C Language Algorithms for Digital Signal Processing" by P.M. Embree and B. Kimble.for
(r = 0; r < N - K + 1; r++) {
output_image_offset
=
output_image_ptr;
}
output_image_offset += dead_cols; col = 0; for (c = 0; c < N - K + 1; c++) { input_image_ptr
=
input_image;
}
input_image_ptr += (N * row); kernel_ptr = kernel; sum = 0; for (i = 0; i < K; i++) { input_image_offset
=
input_image_ptr;
}
input_image_offset += col; kernel_offset = kernel_ptr; for (j = 0; j < K; j++) { temp1 = *input_image_offset++; temp2 = *kernel_offset++; sum += temp1 * temp2; kernel_ptr += K; input_image_ptr += N; *output_image_offset++ = (sum / normal_factor); col++; Compiler generated code for the inner loop (which contains another loop) looks as following: loopstart2
L44[
doen3 #2 ;[0] @II4 move.l (r0)+,d5 ;[300] 0%=0 ] move.l (r3)+,d6 ;[301] 0%=0 [ impysu d5,d6,d7 ;[302] 1%=0 impyuu d5,d6,d12 ;[302] 1%=0 ] imacus d5,d6,d7 ;[302] 2%=0 asll #<16,d7 ;[302] 3%=0 falign loopstart3 L43 [ iadd d12,d7 ;[302] 4%=1 move.l (r0)+,d5 ;[300] 0%=0 move.l (r3)+,d6 ;[301] 0%=0 ] [ impyuu d5,d6,d12 ;[302] 1%=0 add d0,d7,d0 ;[301] 5%=1 impysu d5,d6,d7 ;[302] 1%=0 ] imacus d5,d6,d7 ;[302] 2%=0 asll #<16,d7 ;[302] 3%=0 loopend3 [ iadd d12,d7 ;[302] 4%=1 adda #>116,r0,r0 ;[0] ] add d0,d7,d0 ;[301] 5%=1 loopend2 each iteration of which is executed in 15 clocks. Using sco with <sco_options> set to –lu 2 generates the following equivalent of the above: loopstart2
L44:
[
move.l (r3)+,d6 move.l (r0)+,d5 ] [ move.l (r3)+,d13 move.l (r0)+,d14 impysu d5,d6,d7 ] [ move.l (r3)+,d8 move.l (r0)+,d9 impysu d14,d13,d1 ] [ imacus d5,d6,d7 impysu d9,d8,d10 ] [ imacus d14,d13,d1 asll #16,d7 impyuu d5,d6,d12 ] [ imacus d9,d8,d10 asll #16,d1 impyuu d14,d13,d4 ] [ iadd d12,d7 asll #16,d10 impyuu d9,d8,d11 iadd d4,d1 ] [ add d0,d7,d0 iadd d11,d10 adda #116,r0,r0 ] add d0,d1,d0 add d0,d10,d0 loopend2 each iteration of which is executed in 10 clocks – 33% improvement. top Matrix MultiplicationThis is another important algorithm that may be found as part of many DSP applications:A
= a_matrix;
B = b_matrix; for (i = 0; i < A_ROW; i++) { for
(j = 0; j < B_COL; j++) {
}
a
= A;
}
b = B; sum = 0; for (k = 0; k < B_ROW; k++) { term_a
= *a;
}
term_b = *b; a++; b += B_COL; sum += term_a * term_b; *c_matrix++ = sum; B++; A += A_COL; B = b_matrix; Compiler generated code for the innermost loop looks as following: loopstart3
L56
[
iadd d8,d4 ;[81] 4%=1 move.l (r1)+,d2 ;[77] 0%=0 move.l (r3)+n3,d3 ;[78] 0%=0 ] [ impyuu d2,d3,d8 ;[81] 1%=0 iadd d4,d1 ;[80] 5%=1 impysu d2,d3,d4 ;[81] 1%=0 ] imacus d2,d3,d4 ;[81] 2%=0 asll #<16,d4 ;[81] 3%=0 loopend3 and processes one point in 4 clocks. Using sco with <sco_options> set to –lu 2 –v generates the following equivalent of the above kernel: loopstart3
__dco_vr_5:
[
imacus d7,d6,d4 move.l (r1)+,d2 move.l (r3)+n3,d3 iadd d5,d0 ] [ impyuu d7,d6,d8 asll #16,d4 iadd d0,d1 move.l (r1)+,d7 ] [ impysu d2,d3,d0 iadd d8,d4 move.l (r3)+n3,d6 ] [ imacus d2,d3,d0 iadd d4,d1 impyuu d2,d3,d5 ] [ asll #16,d0 impysu d7,d6,d4 ] loopend3 and processes two points in 5 clocks – 38% improvement. top Reed-Solomon encoder/decoderEncoder/decoder for Reed-Solomon codes – we list it to show how effective sco may be when optimizing code that is not “loop based”. The source code of the Reed-Solomon encoder/decoder is too large to reproduce here – we will study only the inner part of the encoder:for
(i=kk-1; i>=0; i--)
{ feedback
=
index_of[data[i]^bb[nn-kk-1]] ;
}
; if (feedback != -1) { for
(j=nn-kk-1; j>0; j--)
}
if
(gg[j] != -1)
bb[0]
= alpha_to[(gg[0]+feedback)%nn] ; bb[j]
= bb[j-1]^alpha_to[(gg[j]+feedback)%nn] ;
else
bb[j]
= bb[j-1] ;
else { for
(j=nn-kk-1; j>0; j--)
}
; bb[j]
= bb[j-1] ;
bb[0]
= 0 ; which is translated by the compiler into the following (only part is listed): L24
[
move.l (r5),d2 ;[133] move.l (r2),d3 ;[134] B6 ] [ cmpeq.w #-1,d2 ;[133] add d2,d4,d0 ;[134] B6 move.w #255,d1 ;[134] B6 ] bt <L27 ;[133] jsr ___Qmod32_s ;[134] [ asll #<2,d0 ;[134] move.l #_alpha_to,r7 ;[134] ] move.l d0,r6 ;[134] nop ;[0] AGU stall adda r7,r6 ;[134] move.l (r6),d2 ;[134] [ eor d3,d2 ;[134] jmpd L26 ;[135] ] move.l d2,(r4) ;[134] L27 move.l (r2),d3 ;[136] move.l d3,(r4) ;[136] L26 [ sub #1,d5 ;[132] suba #<4,r4 ;[132] suba #<4,r5 ;[132] ] [ tstgt d5 ;[132] suba #<4,r2 ;[132] ] bt L24 Using sco with default options (<sco_options> is left empty) generates the following equivalent of the above kernel: L24:
[
move.l (r5),d2 move.w #255,d1 ] [ cmpeq.w #-1,d2 add d2,d4,d0 move.l (r2),d3 ] btd <L27 ift move.l d3,(r4) ; gso 4.15 - 3.12 jsr ___Qmod32_s ; gso 3.3 - 3.3 [ move.l #_alpha_to,r7 move.l d0,r6 ] nop move.l (r7+r6),d2 eor d3,d2 move.l d2,(r4) ; gso 6.9 - 8.14 L27: ; gso 0.0 - 2.2 L26: [ sub #1,d5 suba #4,r2 suba #4,r5 ] [ tstgt d5 suba #4,r4 ] bt L24 It is difficult to estimate the improvement achieved by sco, but the difference in the quality of codes is apparent. For example, note the way blocks at L27 was transformed and jmpd L26 was eliminated. top Customer-created applicationDue to obvious reasons we will omit C source listing of the code. It was pointed out that the following compiler-generated loop is greatly affects overall performance of the code:loopstart3
PL003 [
ift tfr d2,d5 ;[454] 7%=1 ift tfr d11,d9 ;[455] 7%=1 ifa move.f (r9)+n1,d2 ;[447] 0%=0 ] [ ift tfr d13,d8 ;[456] 7%=1 ift tfr d14,d7 ;[458] 7%=1 ifa move.f (r11)+n1,d11 ;[446] 0%=0 ] [ add d15,d11,d11 ;[446] 1%=0 add #<5,d14 ;[445] 8%=1 move.l (sp-68),d13 ;[447] 1%=0 move.f #2048,d12 ;[448] 1%=0 ] [ ift move.w (sp-98),d10 ;[457] 9%=1 ifa mac #8192,d2,d13 ;[447] 2%=0 ifa move.f (r10)+n1,d2 ;[448] 2%=0 ] [ macr d2,d12,d13 ;[448] 3%=0 mpy d11,d11,d2 ;[449] 3%=0 ] mpy d8,d2,d12 ;[451] 4%=0 mac -d5,d13,d12 ;[451] 5%=0 tstgt d12 ;[453] 6%=0 loopend3 which takes 10 clocks per iteration. Using sco with <sco_options> set to –v generates the following equivalent of the above kernel: falign
loopstart3
__dco_vr_2:
[
mac -d5,d13,d12 move.f (r11)+n1,d4 move.f (r9)+n1,d3 ] [ ift move.w (r1),d10 ifa tstgt d12 move.f #2048,d1 ] [ tfrt d13,d8 move.l (r0),d13 tfrt d11,d9 add d15,d4,d11 ] [ mac #8192,d3,d13 move.f (r10)+n1,d0 tfrt d2,d5 mpy d11,d11,d2 ] [ tfrt d14,d7 add #5,d14 macr d0,d1,d13 mpy d8,d2,d12 ] loopend3 which processes loop iteration in 5 clocks – 50% improvement. top |
![]() |