Mandelbrot white paper

Implementing Fractals with
MMX™ Technology
Information for Developers and ISVs
From Intel® Developer Services
www.intel.com/IDS
March 1996
Information in this document is provided in connection with Intel® products. No license,
express or implied, by estoppel or otherwise, to any intellectual property rights is granted
by this document. Except as provided in Intel's Terms and Conditions of Sale for such
products, Intel assumes no liability whatsoever, and Intel disclaims any express or
implied warranty, relating to sale and/or use of Intel products including liability or
warranties relating to fitness for a particular purpose, merchantability, or infringement of
any patent, copyright or other intellectual property right. Intel products are not intended
for use in medical, life saving, or life sustaining applications. Intel may make changes to
specifications and product descriptions at any time, without notice.
Note: Please be aware that the software programming article below is posted as a public
service and may no longer be supported by Intel.
Copyright © Intel Corporation 2004
* Other names and brands may be claimed as the property of others.
Implementing Fractals with MMX™ Technology
March 1996
CONTENTS
1.0 Introduction
2.0 The Mandelbrot Set
3.0 Input and Output Data Representation
4.0 Code Partitioning
4.1 Data Setup
4.2 Inner Loop Section
4.2.1 Top Level Loops
4.2.2 Iteration Loop
4.2.3 Pixel Display
4.3 Code Comparison
4.4 Sample Code
5.0 Performance
APPENDIX A: Code for the Mandelbrot Set
1
Implementing Fractals with MMX™ Technology
March 1996
1.0 Introduction
The Intel Architecture MMX™ technology extensions use a Single Instruction Multiple Data (SIMD)
technique to increase the efficiency of the software by operating on multiple data elements in parallel. In
addition to the scalar registers the availability of eight 64-bit registers can potentially decrease the usage
of memory as temporary space thus reducing costly memory accesses. This application note illustrates
how to use the MMX technology to achieve better performance in generating Mandelbrot Set fractals. In
graphics applications fractal geometry methods are used to realistically describe and generate natural
objects such as mountains and clouds. Unlike continuous Euclidean shapes fractal objects retain their
sharpness on magnification, such that as the camera gets closer to an object smaller detail becomes more
visible. In this paper the code for an MMX technology function, a scalar assembly function, and a floating
point function are presented, and performance and accuracy trade-offs are explained. Note that the
concepts and coding tricks used in this paper are applicable to fractal functions other than the Mandelbrot
Set. Figure 1 shows a sample Mandelbrot Set fractal.
Figure 1: Sample Mandelbrot Set Fractal
2
Implementing Fractals with MMX™ Technology
March 1996
2.0 The Mandelbrot Set
Fractal geometry uses procedures to describe fragmented or irregular shaped objects. In theory these
objects are represented with procedures which are repeated infinitely. However, for graphical
applications, the procedures are run for a finite count. There are several classes of fractals, and one of
these uses nonlinear transformations named "invariant fractal sets". The Mandelbrot set, which is a selfsquaring fractal, is included in this class.
The Mandelbrot set is the set of complex values that do not diverge under the squaring transformation. In
this application note, a complex number, z is defined to be an ordered pair of real numbers and is defined
as:
z=x + jy ;where x=Re(z), y=Im(z)
The real and imaginary parts will further be referenced as c.r and c.i respectively.
Thus the squaring transformation for the Mandelbrot set can be written as:
z0 = z
zk = z2k-1 + z0, k=1,2,3,...
zk is repeatedly calculated until it can be determined whether or not the transform is diverging.The fractal
is the boundary of the convergence region in the complex plane.
To generate the Mandelbrot set a window is chosen in the complex plane. The major part of the set is in
the region:
-2.25 <= Re(z) <= 0.75
-1.25 <= Im(z) <= 1.25
By selecting smaller windows in this range one can effectively zoom in and generate more detailed
images. The algorithm maps the positions in this region to color coded pixel positions on the display
surface. By using the magnitude of the complex number one can determine whether or not the number
will diverge quickly. One can set a magnitude limit and iterate until this limit is reached or an arbitrary
iteration limit is reached. The iteration count then can be used as an index to a color palette and the
resulting pixel color is put on the display surface. The high level algorithm implemented in C is given
below:
void mandelbrot ( rMin, rInc, iMin,
{
//
//
iInc,
cols,
int xc, yc, ccount;
COMPLEX zInit;
PALETTE LocalPalette[256];
rInc is defined as (rMax - rMin) /cols;
iInc is defined as (iMax - iMin) /rows;
zInit.r = rMin;
for (xc=0; xc<cols; xc++)
{
zInit.i = iMin;
for (yc=0; yc<rows; yc++)
3
rows)
Implementing Fractals with MMX™ Technology
March 1996
{
ccount = Iterate(zInit);
PutPixel (xc, yc, LocalPalette[ccount]);
zInit.i = zInit.i + iInc;
}
zInit.r = zInit.r + rInc;
}
}
int Iterate (COMPLEX zInit)
{
int cnt= 0;
COMPLEX z;
z= zInit;
do
{
z = ComplexSqr(z);
c.r = c.r + zInit.r;
c.i = c.i + zInit.i;
cnt++;
}
while ((c.r*c.r+c.i*c.i <=4.0) & (cnt <255));
return cnt;
}
COMPLEX ComplexSqr(COMPLEX c)
{
COMPLEX result;
result.x = c.r * c.r - c.i * c.i ;
result.y = 2 * c.r * c.i ;
return result;
}
Figure 2. The High Level Algorithm Implemented in C
The Mandelbrot function as shown in Figure 2 calculates the color intensity for each pixel and maps them
to the display surface. The size of the display surface is defined by the row and column size, and the
complex window is defined by the complex values passed in to the function. The color intensity value is
returned by the iterate function. The maximum limit for the iteration count is set to 256, which will give
256 different colors with a 256 color palette. At each iteration the complex number is squared and
incremented by the delta values. Then the magnitude for the resulting complex number is calculated and
checked against the value 4.0. If the magnitude is equal or higher than 4.0 the iteration count is returned.
In the event that the magnitude does not diverge the maximum iteration count is returned.
4
Implementing Fractals with MMX™ Technology
March 1996
3.0 Input and Output Data Representation
The real and imaginary parts of the complex numbers are each represented as a word (16-bits) in fixed
point notation 5.11 (five bits for the whole part and eleven bits for the fractional part). This format was
chosen because the window for the Mandelbrot set can be represented with one decimal digit for the
whole part. Five bits are sufficient to represent the whole part of the signed number in binary. Eleven
fractional bits provide a reasonably sharp image. As sections of the image are enlarged a certain loss of
precision is possible.
By using words to represent the real and imaginary parts, two complex numbers can be stored in one 64bit register, as shown in Figure 3. It is desirable to store multiple numbers in the registers so that they can
be manipulated in parallel with MMX instructions.
Figure 3: Complex Numbers in 64-bit Register
The loop iteration count, which also represents the color intensity index, is an unsigned byte. Even though
eight bits are enough to access the palette a 32-bit integer register is used to avoid prefix penalties. The
color palette is an array of 256 24-bit RGB values derived from the Microsoft Windows System Palette.
Several MMX technology registers are used to hold intermediate values to avoid partial read/writes to
memory. As discussed in Section 2.0, the algorithm requires real and imaginary increment values to be
added to complex numbers in the two loops which process x- and y-coordinates. Two 64-bit registers are
initialized with the appropriate values to perform the increments in one cycle. The register contents are
displayed in Figure 4, below.
Figure 4: Increment Value Registers
Both the complex number magnitude and the iteration count need to be checked at the innermost loop. To
avoid two separate compares these two values are stored in one 64-bit register, and one compare
instruction is used. The register is shown in Figure 5.
5
Implementing Fractals with MMX™ Technology
March 1996
Figure 5: Magnitude/Iteration Count Register
6
Implementing Fractals with MMX™ Technology
March 1996
4.0 Code Partitioning
4.1 Data Setup
In this section the data setup is explained in detail to show how to effectively pass 32-bit and 16-bit
parameters to functions. The parameters to the function are:
mandmmx(
hdc, ptrmparams, CWIN_HEIGHT, CWIN_WIDTH, ptrPal);
The hdc is a window handle which is a 32-bit value and ptrmparams is a 32-bit pointer to the
MMXPARAMS struct which contains the values for the complex window range. CWIN_HEIGHT and
CWIN_WIDTH are 32-bit integer values for the display surface size, and the ptrPal is a 32-bit pointer to
the palette used. Please note that all parameters passed to the function are 32-bit aligned. The structure,
MMXPARAMS, is of particular interest since by setting the data up properly we can avoid partial memory
accesses and also read data 64-bits at a time. The structure is 16-bit word aligned. As explained in Section
3 the Mandelbrot MMX technology code operates on 16-bit words, however, passing in the parameters as
16-bit values would have been costly due to partial reads. The structure is defined as:
typedef struct defMMXPARAMS
{
int mm2_1;
//rMin
int mm2_2;
//iMin
int mm3_1;
//0
int mm3_2;
//iInc
int mm4_1;
//rInc
int mm4_2;
//iMin
} MMXPARAMS;
As the structure element names imply, the first two values are loaded as a 64-bit value into the register
mm2, the next two into mm3, and the last two into mm4. A packssdw instruction is performed on all three
registers to set the data up as 16-bit words. The data setup code is given in Figure 6. Memory is accessed
sequentially to shorten memory access time. Instructions are out of order for proper pairing.
mov edi, mparams
xor edx, edx
movq mm2, [edi]
;pointer to mparams
;clear x-coor counter
;load mm2
rMin iMin
movq mm3, [edi+8]
packssdw mm2, mm2
movq mm4, [edi+16]
packssdw mm3, mm3
;load mm3
0
iInc
;mm2 = rMin iMin rMin iMin
;load mm4
rInc iMin
;mm3 = 0
iInc 0
iInc
packssdw mm4, mm4
;mm4 = rInc iMin rInc iMin
Figure 6: Data Setup Code
The registers are formatted as shown below to facilitate efficiency in the code. The reason this format was
chosen will be evident when the code is discussed later on.
MM2
MM3
MM4
after loading
rMin iMin
0
iInc
rInc iMin
after
rMin
0
rInc
packing to 16-bits
iMin rMin iMin
iInc 0
iInc
iMin rInc iMin
4.2 Inner Loop Section
7
Implementing Fractals with MMX™ Technology
March 1996
The MMX technology code is written using the algorithm given in Figure 2. At the top level there are two
loops, one incrementing the x-coordinate and the other incrementing the y-coordinate. The iteration loop
where the code spends most of its time is the innermost loop. A section of code is used to set up the stack
for the Windows function to display the pixel on the surface. Each of these sections will be described
below.
4.2.1 Top Level Loops
The code section in Figure 7 shows the two top level loops. The code's efficiency is contained in the
method that the increments are performed. The code shown is not optimized.
loopx:
movq mm0, mm2
xor ecx, ecx
loopy:
pxor mm5, mm5
xor eax, eax
;mm0 gets a copy of m2
;clear y coor counter
;clear count and magnitude register
;color/count register
iter:
;;This is the iteration loop
doneiter:
;Set pixel data and make the call
call
SetPixelV@16
;call to SetPixelV
paddsw mm2, mm3
;add iInc to mm2
movq mm0, mm2
;mm0 gets m2
inc ecx
;increment yloop counter
cmp ecx, rows
;done yet?
jne loopy
pand mm2, dword ptr EVENWORDMASK
;mm2 rval 0 rval 0
paddsw mm2, mm4
;add rInc to mm2
inc edx
;increment x loop counter
cmp edx, cols
;done yet?
jne loopx
Figure 7: The Top Level Loops
4.2.2 Iteration Loop
In the iteration loop, MMX instructions are used to operate on multiple data in parallel to speed up the
calculations. For instructional purposes the code given here is not optimized. The performance optimized
code, which is given in Appendix A, needs to be examined carefully to understand the high instruction per
clock ratio of this particular section.
iter:
movq mm1, mm0
;mm0=mm1 is c.r c.i c.r c.i
pmullw mm1, dword ptr NEGMASK ; mm1 = c.r -c.i c.r c.i
pmaddwd mm1, mm0
;multiply add mm1, mm0
psrad mm1, 11
;adjust precision for fix pt
punpckldq mm5, mm1
;mm5 was 0 curcnt
;mm5 is now check curcnt
pcmpgtd mm5, dword ptr CHECKCTRMASK
; check if mm5 greater than 4.0 255
; if mm5 has any 1's then
; either condition was true
packssdw mm5, mm5
;pack upper and lower dwords into words
8
Implementing Fractals with MMX™ Technology
March 1996
movd ebx, mm5
cmp ebx, 0h
jne doneiter
;ebx gets mm5
;check if ebx is all zeros
;if not we are done iterating
inc eax
;increment color/counter register
movd mm5, eax
;lower dword of mm5 is color/cnter value
psrlq mm1, 32
;mm1 0 res.r
movq mm6,mm0
;mm6 gets a copy of mm0
psrlq mm0, 16
;mm0 = 0 c.r c.i c.r
pand mm0, dword ptr ODDWORDMASK ;mm0 = 0 c.r 0
c.r
pmaddwd mm0, mm6
;mm0 is 0+c.r*c.i 0+c.r*c.i
psrad mm0, 10
;adjust precision for fx pt
;actually (fracbits-1) for the multiply.
punpckldq mm0, mm1
;mm0 res.r res.i
packssdw mm0, mm0
;mm0 is res.r res.i res.r res.i
paddsw mm0,mm2
;mm1 is z^2+z0
jmp iter
Figure 8: The Iteration Loop
There is no pairing shown in the above code. Figure 8 is a translation of the code in Figure 2 to MMX
technology without regard to optimization. Note that in the first section with one pmaddwd instruction
both the result.r and the magnitude are calculated. Also, by using the mm5 register to perform the
compare both the magnitude and the maximum iteration count are checked in parallel. During the
calculation of result.i the multiply by two is done without an extra shift or multiply instruction by shifting
right one less than the number of fractional bits.
Even in this form, the MMX technology code will be quite efficient However, there are register
dependencies on registers mm0, mm1, and mm5 since the operations are performed sequentially. By
taking some of the instructions which work on mm0 and interleaving them with the instructions which use
mm1 and mm5, the pairing can be increased while lowering penalties. The resulting code with the optimal
pairing is in Appendix A.
4.2.3 Pixel Display
Pixel mapping to the display surface is accomplished by using the Windows API function SetPixelV. The
parameters for this function must be pushed on the stack before the call is made. The integer registers are
not preserved on returning from the function. Note that using this function is not the most optimal way of
putting the pixels on the display surface. However, it keeps the overall algorithm simpler. It is also
possible to store the pixel information in the memory and then use a high level block transfer function to
update the display. Interested readers are encouraged to try this approach.
The sequence to call the SetPixelV is given below:
mov eax, PPal[eax]
mov
ebx,hdc
push
eax
push
ecx
push
edx
push
ebx
call
SetPixelV@16
9
;index to the palette
;the handle
;the color
;the y coor
;the x coor
;call to SetPixelV
;@16 is reqired when the function
;is called from an assembly routine.
Implementing Fractals with MMX™ Technology
March 1996
Figure 9: Pixel Display
4.3 Code Comparison
The Mandelbrot algorithm shown in Figure 2 has also been implemented using integer assembly and
floating point code. These two versions are also included in Appendix A.
Note that the integer scalar code uses 32-bit registers but the fixed point format is left as 5.11. Prefix
penalties are avoided by using 32-bit registers and 32-bit instructions. Since there are different exit
conditions from the innermost loop, it is very difficult to save variables on the stack and retrieve them.
Thus, memory is used to store most of the intermediate values. Wherever possible registers are saved on
the stack. There is no apparent way to operate on two complex numbers simultaneously. However, with
careful use of all available general purpose registers, penalties are avoided. Also note that imul
instructions take ten clocks, as opposed to the single clock needed to issue pmaddwd instructions.
The floating point code uses single precision numbers. The code flow is similar to the integer version.
One of the drawbacks is having to perform operations on the top of the stack as mandated by the
definition of the instructions. This increases dependency on previous values and causes pipeline stalls.
Additionally, saving from the top of the stack to memory takes several clock cycles. Floating point
multiply instructions take three clocks each as opposed to the ten required by imul.
4.4 Sample Code
The executable file for the Mandelbrot Set is also included for downloading. This code is meant to be run
under Microsoft Windows '95 OS on an Intel system with an MMX technology processor, and has not
been tested on any other systems. The program can be run to obtain the number of CPU clocks execution
takes with different complex windows.
The user can select either default parameters or custom parameters for the complex window. Once
parameters are entered, the user executes the "run" instruction and the display is updated.
10
Implementing Fractals with MMX™ Technology
March 1996
5.0 Performance
All three versions of the Mandelbrot function are optimized for speed. Memory accesses are 32-bit
aligned and the instructions are paired where possible. The timings were measured with Intel's VTune 2.0
[please see the VTune Home Page for information on ordering this product.]. The application also
includes a utility which measures the number of CPU ticks taken by the different functions. The static
analysis data including timing and pairing information obtained with VTune is for the functions presented
in Appendix A. The clock cycles reported by the shell program include all the clock cycles for data setup
and call for the individual functions in the C portion. They also show all the clocks needed to draw the
complete object on the display surface using the SetPixelV function, thus they reflect real execution time.
All performance numbers are based on a prototype 150MHz Pentium(R) Processor with MMX
technology.
Using the default parameters for the complex window given in Section 2, and using 320 for
CWIN_WIDTH and 240 for CWIN_HEIGHT, the integer assembly version takes about 425 million
clocks, the floating point version takes about 407 million clocks, and the MMX technology version uses
about 286 million clocks. By changing the size of the complex window different clock counts are
obtained. With default parameters the MMX technology code is 42 % faster than the floating point code
and 49 % faster than the integer code. With certain complex windows the MMX technology version will
perform upto 100% faster than the other two versions. The clocks per pixel is high due to the inefficiency
of the SetPixelV function. However, the SetPixelV function adds the same number of cycles to each
function.
The values obtained using VTune are given in the tables below. The first table contains the three functions
outlined in the text. The second table lists the C-functions which are used to arrange the parameters and
call the functions listed in the first table. Note that the calling functions for integer and MMX technology
versions use more instructions for data setup for the fixed point conversion. Effects of cache misses and
other dynamic issues are not discussed since they do not play a major role with the algorithm used. The
color palette has only 256 entries and is probably in the cache most of the time.
Table 1-2) Performance Numbers
11
Implementing Fractals with MMX™ Technology
March 1996
APPENDIX A
TITLE
mandmmx
;***************************************************************************/
;*
;*
This program has been developed by Intel Corporation.
;*
You have Intel's permission to incorporate this code
;*
into your product, royalty free. Intel has various
;*
intellectual property rights which it may assert under
;*
certain circumstances, such as if another manufacturer's
;*
processor mis-identifies itself as being "GenuineIntel"
;*
when the CPUID instruction is executed.
;*
;*
Intel specifically disclaims all warranties, express or
;*
implied, and all liability, including consequential and
;*
other indirect damages, for the use of this code,
;*
including liability for infringement of any proprietary
;*
rights, and including the warranties of merchantability
;*
and fitness for a particular purpose. Intel does not
;*
assume any responsibility for any errors which may
;*
appear in this code nor any responsibility to update it.
;*
;* * Other brands and names are the property of their respective
;*
owners.
;*
;***************************************************************************/
; This program was assembled with Microsoft MASM 6.11d
;
; prevent listing of iammx.inc file
.nolist
INCLUDE iammx.inc
; IAMMX
Macros
.list
.586
.model FLAT, STDCALL
extern SetPixelV@16:proc
;Needed to put pixels on the surface
;****************************************************************************
;
Data Segment Declarations
;****************************************************************************
.data
ODDWORDMASK
QWORD
0000FFFF0000FFFFh
EVENWORDMASK
QWORD
0FFFF0000FFFF0000h
EVENDWORDMASK
QWORD
0FFFFFFFF00000000h
NEGMASK
QWORD
0001FFFF00010001h
CHECKCTRMASK
QWORD
00001FFF000000FFh
FOUR
DD
4.0; do not delete the period.
;Local vars for the mandfpu
fzinitX
DD
?
fzinitY
DD
?
fzX
DD
?
fzY
DD
?
frInc
DD
?
fiInc
DD
?
;Local vars for the mandasm
zinitX
DWORD
?
zinitY
DWORD
?
zX
DWORD
?
zY
DWORD
?
temp
DWORD
?
temp1
DWORD
?
;****************************************************************************
;
Constant Segment Declarations
;****************************************************************************
.const
12
Implementing Fractals with MMX™ Technology
March 1996
;****************************************************************************
;
Code Segment Declarations
;****************************************************************************
.code
COMMENT ^
void mandmmx (
int
*hdc,
int
mparams
int
cols,
int
rows,
int
PPal) ;
^
mandmmx PROC NEAR C USES eax ebx ecx edx,
hdc: PTR DWORD,
mparams: DWORD,
cols: DWORD, rows: DWORD, PPal: DWORD
mov edi, mparams
;pointer to mparams
xor edx, edx
;clear x-coor counter
movq mm2, [edi] ;load mm2
rMin iMin
movq mm3, [edi+8]
packssdw mm2, mm2
movq mm4, [edi+16]
packssdw mm3, mm3
;load mm3
0
iInc
;mm2 = rMin iMin rMin iMin
;load mm4
rInc iMin
;mm3 = 0
iInc 0
iInc
packssdw mm4, mm4
;mm4 =
loopx:
movq mm0, mm2
xor ecx, ecx
loopy:
pxor mm5, mm5
xor eax, eax
rInc iMin rInc iMin
;mm0 gets a copy of m2
;clear y coor counter
;clear count and magnitude register
;color/count register
iter:
movq mm1, mm0
;mm0=mm1 is c.x c.y c.x c.y
pmullw mm1, dword ptr NEGMASK ; mm1 = c.x -c.y c.x c.y
movq mm6,mm0
;mm6 gets a copy of mm0
pmaddwd mm1, mm0 ;multiply add mm1, mm0
psrlq mm0, 16
;mm0 = 0 c.x c.y c.x
pand mm0, dword ptr ODDWORDMASK ;mm0 = 0 c.x 0
c.x
psrad mm1, 11
;adjust precision for fix pt
punpckldq mm5, mm1
;mm5 was 0 curcnt
;mm5 is now check curcnt
pmaddwd mm0, mm6 ;mm0 is 0+c.x*c.y 0+c.x*c.y
pcmpgtd mm5, dword ptr CHECKCTRMASK
; check if mm5 greater than 4.0 255
; if mm5 has any 1's then
; either condition was true
psrlq mm1, 32
; mm1 0 res.x
packssdw mm5, mm5
;pack upper and lower dwords into words
psrad mm0, 10
;adjust precision for fx pt
;actually (fracbits-1) for the multiply.
movd ebx, mm5
;ebx gets mm5
cmp ebx, 0h
;check if ebx is all zeros
jne doneiter
;if not we are done iterating
inc eax
punpckldq mm0, mm1
movd mm5, eax
packssdw mm0, mm0
paddsw mm0,mm2
jmp iter
doneiter:
;increment color/counter register
;mm0 res.x res.y
;lower dword of mm5 is color/cnter value
;mm0 is res.x res.y res.x res.y
;mm1 is z^2+z0
13
Implementing Fractals with MMX™ Technology
March 1996
;Set pixel
mov eax, PPal[eax]
;index to the palette
push edx
;save xcoor counter
push ecx
;save the ycoor counter
mov
ebx,hdc
;the handle
push
eax
;the color
push
ecx
;the y coor
push
edx
;the x coor
push
ebx
call
SetPixelV@16
;call to SetPixelV
paddsw mm2, mm3
;add iInc to mm2
pop ecx
pop edx
movq mm0, mm2
;mm0 gets m2
inc ecx
;increment yloop counter
cmp ecx, rows
;done yet?
jne loopy
pand mm2, dword ptr EVENWORDMASK ;mm2 rval 0 rval 0
inc edx
;increment x loop counter
cmp edx, cols
;done yet?
paddsw mm2, mm4
;add rInc to mm2
jne loopx
Done:
emms
ret
mandmmx ENDP
;;*********************************************************************************
COMMENT ^
void mandasm (
int
*hdc,
int
mparams
int
cols,
int
rows,
int
PPal) ;
^
mandasm PROC NEAR C USES eax ebx ecx edx,
hdc: PTR DWORD,
mparams: DWORD,
cols: DWORD, rows: DWORD, PPal: DWORD
mov edi, mparams
mov edx, [edi]
mov zinitX, edx
xor edx, edx
loopx:
mov ecx, [edi+8]
mov zinitY, ecx
xor ecx, ecx
loopy:
mov eax, zinitX
mov ebx, zinitY
mov zX, eax
mov zY, ebx
xor eax, eax
iter:
push edx
mov ebx, zY
mov edx, zX
imul ebx, zY
imul edx, zX
;ptr to mparams
;clear x-coor counter
;imin
;clear y coor counter
;color/count register
;zY into register ebx
;zX into register edx
;ebx is c.y*c.y
;edx is c.x*c.x
14
Implementing Fractals with MMX™ Technology
March 1996
sar ebx, 11
mov temp, ebx
sar edx, 11
mov temp1, edx
add ebx, edx
pop edx
cmp ebx, 01FFFh
jg doneiter
cmp eax, 00FFh
jg doneiter
push edx
mov ebx, zY
inc eax
imul ebx, zX
mov edx, temp1
sar ebx, 10
add ebx, zinitY
sub edx, temp
mov zY, ebx
add edx, zinitX
mov zX, edx
pop edx
;shift for fixed pt
;temp is c.y*c.y
;shift for fixed pt
;temp1 is c.x*c.x
;ebx is c.x*c.x+c.y*c.y
;check if ebx is 4 or more
;if so we are done iterating
;have we reached the iteration count
;zY into register ebx
;increment counter
;ebx is c*x*c.y
;edx gets temp1=c.x*c.x
;2*c.x*c.y aligned for fixed pt.
;ebx is 2*c.x*c.y+zinitY
;edx is c.x*c.x-c.y*c.y
;zY is 2*c.x*c.y+zinitY
;edx is c.x*c.x-c.y*c.y+zinitX
;zX is c.x*c.x-c.y*c.y+zinitX
jmp iter
doneiter:
;Set pixel
mov eax, PPal[eax]
;index to the palette
push edx
push ecx
mov
ebx,hdc
;the handle
push
eax
;the color
push
ecx
;the y coor
push
edx
;the x coor
push
ebx
call
SetPixelV@16
;call to SetPixelV
mov edx, zinitY
pop ecx
add edx, [edi+12]
; add iInc
inc ecx
;increment yloop counter
mov zinitY, edx ;save new zinitY
pop edx
cmp ecx, rows
;done yet?
jne loopy
mov ebx, zinitX
inc edx
;increment x loop counter
add ebx, [edi+4] ;add rInc
cmp edx, cols
;done yet?
mov zinitX, ebx
;save new zinitX
jne loopx
Done:
ret
mandasm ENDP
;;**************************************************************************
COMMENT ^
void mandfpu (
int
*hdc,
int
mparams
int
cols,
int
rows,
int
PPal) ;
^
15
Implementing Fractals with MMX™ Technology
March 1996
mandfpu PROC NEAR C USES eax ebx ecx edx,
hdc: PTR DWORD,
mparams: DWORD,
cols: DWORD, rows: DWORD, PPal: DWORD
;;All floating point instructions are in capital letters.
mov edi, mparams
;ptr to mparams
FINIT
;initialize the Funit
mov edx, [edi]
xor eax, eax
;clear eax
mov fzinitX, edx
;rMin
mov ebx, [edi+4]
mov frInc, ebx
;rInc
mov ecx, [edi+12]
mov fiInc, ecx
;iInc
xor edx, edx
;clear x-coor counter
loopx:
mov ecx, [edi+8]
;iMin
mov fzinitY, ecx
;fzinitY set to iMin
xor ecx, ecx
;clear y coor counter
loopy:
mov eax, fzinitX
mov ebx, fzinitY
mov fzX, eax
;initialize fzX with fzinitX
mov fzY, ebx
;initialize fzY with fzinitY
xor ebx, ebx
;color intensity register
iter:
FLD fzY
;load zY to ST
FMUL fzY
;ST is c.y*c.y
FST ST(1)
;ST(1) is c.y*c.y
FLD fzX
FMUL fzX
;load zX to ST
;ST is c.x*c.x
FST ST(3)
;ST(3) is zX^2
;ST(3) is zX^2
;ST(2) is zY^2
;ST(1) is zY^2
;ST is zX^2
FADDP ST(1), ST
;ST is c.x*c.x+c.y*c.y
FCOMP FOUR
;check if ebx is 4 or more
;ST(1) is zX^2
;ST is zY^2
FSTSW AX
;flags stored in AX
test eax, 4500h
;mag greater then 4.
jz doneiter
;if so we are done iterating
test eax, 4000h
;mag equal to 4.
jnz doneiter
;if so we are done iterating
cmp ebx, 00FFh
;have we maxed out on colors?
jg doneiter
inc ebx
;increment color intensity
;ST(1) is zX^2
;ST is zY^2
FSUBR ST , ST(1)
;ST=ST(1)-ST;
FLD
FMUL
FADD
FADD
FSTP
fzX
fzY
ST, ST(0)
fzinitY
fzY
FADD fzinitX
FSTP fzX
;STACK is empty
;ST is fzX
;ST is c.x*c.y
;ST is 2*c.x*c.y
;ST is 2*c.x*c.y+fzinitY
;store and pop
;ST is c.x*c.x-c.y*c.y+fzinitX
;store and pop
16
Implementing Fractals with MMX™ Technology
March 1996
jmp iter
doneiter:
;Set pixel
mov eax, PPal[ebx]
push edx
push ecx
mov
ebx,hdc
push
eax
push
ecx
push
edx
push
ebx
call
SetPixelV@16
FLD fzinitY
FADD fiInc
pop ecx
pop edx
inc ecx
cmp ecx, rows
FSTP fzinitY
;STACK is empty
jne loopy
FLD fzinitX
inc edx
FADD frInc
cmp edx, cols
FSTP fzinitX
;STACK is empty
jne loopx
;index to the palette
;the
;the
;the
;the
handle
color
y coor
x coor
;call to SetPixelV
;ST is fzinitY
;ST fzinitY+fiInc
;increment yloop counter
;done yet?
;fzinitY=fzinitY+fiInc --pop ST
;ST is fzinitX
;increment x loop counter
;ST is zinitX+frInc
;done yet?
;fzinitX=fzinitX+frInc --pop ST
Done:
ret
mandfpu ENDP
END
17