From: "Jon Harris" <jon_harris@SPAM-B-GONEgeocities.com>
Subject: Trick: Taking advantage of parallel instructions for highly efficient
looping code
Date: 11 Apr 2000 00:00:00 GMT
Newsgroups: comp.dsp
THIS WORK IS PLACED IN THE PUBLIC DOMAIN
Name: Taking advantage of parallel instructions for highly efficient looping
code
Category: General software
Application: When writing standard DSP routines in assembly language for
architectures that include parallel instructions (load/store/compute).
Advantages: Minimizes time required to execute an algorithm
Introduction: Most DSP chips today include instructions that perform
parallel memory accesses and computations. By some simple rewriting and
reordering of your code, you can take advantage of these instructions to
speed up your inner loop code significantly.
The Trick:
Most DSP algorithms take the general form of:
loop for n times:
-Fetch some data
-Do some mathematical operations on the data
-store the result(s)
end loop
Furthermore, most DSPs today include assembly language instructions that can
fetch/store data and do a mathematical operation (e.g. add, multiply, or
MAC) in parallel. In order to write maximally efficient code, you must take
advantage of these instructions whenever possible. Obviously, coding the
algorithm as listed above is not the best way to do this. This trick will
explain simple ways to improves the efficiency over the form above. Many of
these are simple "common sense" type tricks, but since I have never seen
them written, I thought I would codify them.
If there is more than one fetch and computation to be done, a first level
improvement would be as follows:
loop for n times:
-Fetch first piece of data
-Fetch next piece of data, calculate using first piece (replicate as
necessary)
-Do final calculation
-store the result
end loop
This results in a little improvement, but still more can be made! Two
methods for re-arranging the loop are presented below:
Method 1: "Priming the pump"
In this method of re-writing the code, the inner loop minus the storage step
is rewritten just before the main loop. Then the storage of the result can
be done in parallel with either the next fetches or calculations, depending
on the specific nature of the algorithm and processor. This removes the
storage step from the inner loop. Here is the general method:
-Fetch some data
-Do some mathematical operations on the data
loop for n-1 times: /*note n-1 rather than n*/
-Fetch some data, store previous result(s)
-Do some mathematical operations on the data, store previous result(s)
end loop
-store final result(s)
Method 2: "Bogus write"
Another method exists, which eliminates the extra code above the loop and
often results in slightly more efficient code. The one downside is that the
first time the result(s) are stored, they are invalid, hence the name "bogus
write". However, in many cases you can overcome this problem by making your
storage buffers one memory location larger than they actually need to be and
setting your write pointer to the "dummy" location one address before the
real buffer starts. This is not always possible (e.g. with circular
buffers), but when it is, it is a worthwhile method. Here is the general
method:
loop for n times:
-Fetch some data, store result(s) /*1st time through bogus write*/
-Do some mathematical operations on the data, store the result(s) /*1st time
through bogus write*/
end loop
-store the final result(s)
An example will illustrate these techniques. The example used will be a
series of 16 simple 2:1 audio mixers where 2 input values are fetched,
multiplied by 2 gain values then added together and stored. The code will
be Analog Devices SHARC DSPs.
Simple-minded approach, least efficient:
pseudo-code:
loop for n times:
-fetch 1st audio value
-fetch 2nd audio value
-fetch 1st gain coef.
-fetch 2nd gain coef.
-multiply 1st audio by gain
-multiply 2nd audio by gain
-add the 2 products
-store the result
end loop
SHARC code:
/*Assume I0=audio data, I8=gain values, I1=result buffer
NUM_CHANNELS=16*/
LCNTR=NUM_CHANNELS, DO mix_loop-1 UNTIL LCE;
r0=dm(I0,1); /*fetch 1st audio value*/
r1=dm(I0,1); /*fetch 2nd audio value*/
r8=pm(I8,1); /*fetch 1st gain value*/
r9=pm(I8,1); /*fetch 2nd gain value*/
r0=r0*r8;
r1=r1*r9;
r2=r0+r1;
dm(I1,1)=r2; /*store result*/
mix_loop:
Execution takes 129 cycles.
Some parallelism added:
pseudo-code:
loop for n times:
-fetch 1st audio value, fetch 1st gain coef.
-fetch 2nd audio value, fetch 2nd gain coef., multiply 1st audio by gain
-multiply 2nd audio by gain
-add the 2 products
-store the result
end loop
SHARC code:
/*Further assume M1=1, M9=1*/
LCNTR=NUM_CHANNELS, DO mix_loop-1 UNTIL LCE;
r0=dm(I0,M1), r8=pm(I8,M9); /*fetch 1st audio & gain*/
r0=r0*r8, r1=dm(I0,M1), r9=pm(I8,M9); /*1st mult, fetch 2nd*/
r1=r1*r9; /*2nd mult*/
r2=r0+r1; /*add the 2 products*/
dm(I1,1)=r2; /*store the result*/
mix_loop:
Execution takes 97 cycles.
Using Method 1, "Priming the Pump"
pseudo code:
-fetch 1st audio value, fetch 1st gain coef.
-fetch 2nd audio value, fetch 2nd gain coef., multiply 1st audio by gain
-multiply 2nd audio by gain
-add the 2 products
loop for n-1 times:
-fetch 1st audio value, fetch 1st gain coef.
-fetch 2nd audio value, fetch 2nd gain coef., multiply 1st audio by gain
-multiply 2nd audio by gain, store past result
-add the 2 products
end loop
-store the final result
SHARC code:
r0=dm(I0,M1), r8=pm(I8,M9); /*fetch 1st audio & gain*/
r0=r0*r8, r1=dm(I0,M1), r9=pm(I8,M9); /*fetch 2nd audio & gain*/
r1=r1*r9;
r2=r0+r1;
LCNTR=NUM_CHANNELS-1, DO mix_loop-1 UNTIL LCE;
r0=dm(I0,M1), r8=pm(I8,M9); /*fetch 1st audio & gain*/
r0=r0*r8, r1=dm(I0,M1), r9=pm(I8,M9); /*1st mult, fetch 2nd*/
r1=r1*r9, dm(I1,1)=r2; /*2nd mult, store result*/
r2=r0+r1;
mix_loop:
dm(I1,1)=r2; /*store final result*/
Execution takes 66 cycles.
Using Method 2, "Bogus Write"
pseudo code:
loop for n times:
-add the 2 products (1st time bogus), fetch 1st audio value, fetch 1st gain
coef.
-fetch 2nd audio value, fetch 2nd gain coef., multiply 1st audio by gain
-multiply 2nd audio by gain, store past result (1st time bogus)
end loop
-add the 2 products for the final result
-store the final result
SHARC code:
/* Assume I1 is one memory location before results buffer*/
LCNTR=NUM_CHANNELS, DO mix_loop-1 UNTIL LCE;
r2=r0+r1, r0=dm(I0,M1), r8=pm(I8,M9); /*fetch 1st audio & gain*/
r0=r0*r8, r1=dm(I0,M1), r9=pm(I8,M9); /*fetch 2nd audio & gain*/
r1=r1*r9, dm(I1,M1)=r2; /*store, 1st time bogus write*/
mix_loop:
r2=r0+r1; /*do final add*/
dm(I1,M1)=r2; /*store final result*/
Execution is 51 cycles.
Note that care must be taken in register usage to assure results are not
overwritten before they are stored.
As you can see, the bogus write method is very efficient, executing in about
40% of the time for the simple method. It may be possible to further tweak
the priming the pump method to achieve better results, but it will still not
be as good as the bogus write method.
A final note on cache misses: In some architectures, instructions may not
execute in a single cycle unless the code is stored in cache. For example,
with the SHARC, any use of the PM bus where the code is not in cache results
in a 1 cycle stall. Hence, the first time through the loop you run the risk
of additional cycle stalls for every PM use. In this case, the bogus write
method has a further advantage over the priming the pump method because
there are less accesses to the PM bus that could cause cycle stalls.
Obviously, as the number of times through a loop increases, these "first
time" stalls become less significant, but they should be minimized if
optimally efficient code is desired.
|