![]() |
StarCore
Code Optimization |
![]() |
|
||
|
choosing the right options The other approach
is to use advanced tools that optimize your code. In this our StarCore
Optimizer (sco) may help a lot. sco is a software package
specifically designed to optimize StarCore code by taking full advantage
of the options and features provided by the StarCore DSP. This document explains
some of the optimizations, provided by sco, and gives advise
on the best way to utilize them. choosing the
right options
sco utilizes incredible
number of different optimizations and algorithms to speed up the code.
Reviewing all of them will exceed the amount of space allowed for this
article and therefore we will omit it. Most of the optimization options
are enabled by default but many may be disabled using appropriate command
line or source code option (see the documentation for details). The special optimizations
provided by sco are loop unrolling and vectorization. Loop unrolling
Many applications
spend bulk of their time inside inner loop(s) thus making it extremely
important to optimize code of the inner loop. One of the efficient ways
to optimize loops is to perform loop unrolling - to duplicate the body
of a loop specified number of times (sco parameter ‘-lu #’).
Generating more instructions in the body of the loop, generally, creates
more opportunities for generating multinstuctions. However, before using
this powerful optimization, it is important to understand what exactly
sco perform. Understanding Loop unrolling.
The standard single
precision dot product loop: for ( i = 0; i < n; i++ ) is translated by
the compiler ( maximally optimized ) into: doen3 d7 PL001 [ which sco,
when unrolled by 3, transfer to: PL001: bmclr #$4000,sr.h __dco_lu_2: [
[ __dco_lu_1: Close examination
of this code reveals that it consists of 5 different units: 1. Check if loop is
executed for at least 3 iterations and therefore may be unrolled by
3: move.w #-3,d2 2. Unrolled loop setup: bmclr #$4000,sr.h 3. Unrolled loop body: __dco_lu_2: [ 4. Check if there are
iterations left to be performed and if there are not – jump to exit
from the unrolled loop; otherwise - execute the code that executes original
( not unrolled ) loop code ( see unit 5 ). sub d2,d1,d1 5. Execute the original
loop body: __dco_lu_0: [ This code is executed
when it is determined that loop doesn’t have enough iterations to be
unrolled (see unit 1) or when there are iterations left after unrolled
loop is executed (see unit 4). Note, that when unrolling
count is power of 2, sco generates more efficient code. For example,
when unrolling by 4, the generated code ( for the loop body ) looks
as following: falign __dco_lu_1: [ All of the above
leads to the following guidelines for using sco’s loop unroller: ·
Do not unroll loops that has only few iterations to execute or unroll by
the count significantly less then number of iterations. ·
Use power of 2 as loop unrolling count and/or choose the count that divides
number of loop iterations. ·
Do not use loop unrolling if it is important to minimize code size. Vectorization
Vectorization (sco
parameter ‘–v’) is another powerful optimization of the inner loop body.
It overlaps epilog code execution for current loop iteration with prolog
code execution for the next loop iteration. Essentially, the two consecutive
loop iterations are fitted in the block of the size of the loop iteration
(of size n). m instruction are chosen from the bottom of the first iteration
and combined with n - m instructions from the top of the second iteration.
The generated m + ( n - m ) = n instructions are optimized. This is
done for all m from 1 to n-1 and the best resulting code is chosen.
Of course, all necessary checkups to preserve logic of the code are
performed by sco and resulting code. Understanding vectorization
As an example, consider again the kernel of the dot product for ( i = 0; i < n; i++ ) unrolled by 2: doen3 d7 PL001 [ Running vectorization
on this code ( sco -v ) produces the following result: move.l d7,d1 __dco_vr_0: doen3 d1 __dco_vr_2: [ __dco_vr_1: In this case, the
loop body is: [ and m instructions iadd d2,d1 are overlapped with
n-m instructions [ Again, as with loop
unrolling, this code consist of number different units: 1. Check if loop is executed for at least 2 iterations (jump to __dco_vr_0) and if not - perform
one iteration and exit iterations (jump to __dco_vr_1): move.l d7,d1 2. Execute NM instructions from the top of the first iteration that were not
chosen to be combined with instructions of second iteration: __dco_vr_0: doen3 d1 3. Execute vectorized loop: falign __dco_vr_2: [ 4. Execute m instructions from the button of the second iteration that were
not chosen to be combined with instructions of first iteration: iadd d2,d1 __dco_vr_1: So the guidelines
for using sco’s vectorizer will be: ·
Do not attempt to vectorize code that has only few iterations to execute. ·
Use loop unrolling if there are too few instructions in the body of a loop
being vectorized. ·
Beware that vectorization of large inner loop may consume great amount
of CPU time. Results of optimization
and Information
The code patterns
studied above (dot-product) show the following improvements achieved
over the code fully optimized by the ccsc100 compiler:
The similar or better
code improvements were achieved for many others benchmarks and kernels
- click here for more information. |