Page 1 of 2

Has anybody profiled the math library?

Posted: Thu Dec 22, 2022 12:51 am
by Michael Kaiser

https://github.com/commanderx16/x16-docs/blob/master/X16 Reference - 05 - Math Library.md

Does anybody have any idea how many cycles each of the math library functions takes?


Has anybody profiled the math library?

Posted: Thu Dec 22, 2022 5:46 am
by neutrino

Do the timer() .. for(i=0; i<1000; i++) { sin(1.234); } timer() thing?

Any particular reason for the interest in the benchmark data?


Has anybody profiled the math library?

Posted: Sat Dec 24, 2022 3:59 am
by Michael Kaiser


On 12/22/2022 at 12:46 AM, neutrino said:




Do the timer() .. for(i=0; i<1000; i++) { sin(1.234); } timer() thing?



Any particular reason for the interest in the benchmark data?



The math library isn't really wrapped to use from C like that.  It's really assembly oriented.  I'm interested in the performance to decide if it's worth writing faster math functions or just using it.   I haven't bothered to formally profile them, but they seem to perform approximately the same speed as BASIC.   Currently I'm using them to build a table at the beginning of my program which I will use for faster math during the run.


Has anybody profiled the math library?

Posted: Sat Dec 24, 2022 4:16 am
by neutrino

Maybe it would be worthwhile to make a small wrapper interface from C to the math assembler then? I tested to call some CBM Basic v2 math from cc65 a few years ago. Works alright once one gets past the number format.

Otoh, you could just read the system clock from assembler and time it that way?


Has anybody profiled the math library?

Posted: Sat Dec 24, 2022 5:49 am
by Michael Kaiser


On 12/23/2022 at 11:16 PM, neutrino said:




Maybe it would be worthwhile to make a small wrapper interface from C to the math assembler then? I tested to call some CBM Basic v2 math from cc65 a few years ago. Works alright once one gets past the number format.



Otoh, you could just read the system clock from assembler and time it that way?



I may, but I've already determined it's "basic slow".  The math I need is simple enough to pre-calculate into tables, so I'm going with that.


Has anybody profiled the math library?

Posted: Sat Dec 24, 2022 1:30 pm
by desertfish

precalc tables all the way, especially since we have lots of memory on x16 ?


Has anybody profiled the math library?

Posted: Sat Dec 24, 2022 7:06 pm
by Daedalus


On 12/23/2022 at 9:59 PM, Michael Kaiser said:




I'm interested in the performance to decide if it's worth writing faster math functions or just using it.



All day long: It's better and faster to just write faster math functions. All day and in ever way.

But wait! Now I'll completely contradict myself! The reason for this isn't because the math functions are inefficient or slow, it's just that they're generic. As such, it's also a waste of time to try to write a "better math library..." as it will have the same problem of being generic and slow.

What you do is write the math out as much as possible in the first place. Here's an example: Say you have a bunch of on screen things that start from an x,y pixel offset to a base address, as you would have with, say, an X16 app using bitmap graphics to draw stuff. The math is trivial when expressed in C : EffectiveAddr = BaseAddr + (X + (Y *320))  If Base Address is ... say: 0x00000 (The very start of VRAM) and the mode is 320*240, you would be using that calculation everywhere. If you were coding in C, you might just express it as that simple equation all over instead of trying to optimize it. That's a trap of using C, it's so easy to just slap down crazy complicated math, you never stop to think if you could just eliminate the math in the first place.

But wait. What if you never change the mode, and as such never change the number of pixels on a row, and also put them in the same x and y location? Well heck, you need almost no math routines at all then. You can just let the compiler do it with:

offset_address = 168+(204*320)

That just creates a constant that represents a 24 bit number that is the calculated address offset for that feature. In code, you can then use three macros to set the base address, add the offset, then apply that to the VERA registers and VERA address /  stride. Like this:

mem_SET_IMM_24 VRAM_bitmap, ZP24_R0

math_ADD_IMM_24 offset_address, ZP24_R0

vera_SET_VRAM_ADDR ZP24_R0, 0, $10    ;Addr0, stride 1

As you can see, I'm a big fan of macros. And I'm a fan of simply NOT DOING math I can just avoid completely. The only math in there is a 24 bit add of the offset into the zero page 24 bit temp storage. And if I wanted to... I could have just rolled the offset into the mem_SET macro by adding it to the IMM param: (VRAM_bitmap+(168+(204*320))

So yeah... make ca65 do the math, or store a pre calculated lookup table in a file and just load it. Make a tool in C that stores the table and is never even IN the x16 app.

But when there's no way around it? Then you use a math routine tuned to the exact data sizes you need in assembler. You can START with generic routines to get it to work, then as refactor the code or see that it needs optimization, "eliminate the math" to simplify the routine.

 

 


Has anybody profiled the math library?

Posted: Sun Dec 25, 2022 12:57 am
by neutrino

Regarding math function. I did some testing 6 days ago trying to replicate the Commodore 64 sin() function:

Source: https://www.c64-wiki.com/wiki/POLY1

If 'x' is in radians.

x2 = x / (2*pi)

i  = int(x2)

f  = x2 - i

if( 0 <= f <= 0.25 )

    { f2 = f }

elsif( 0.25 < f <= 0.75 )

    { f2 = 0.5 - f }

elsif( 0.75 < f < 1 )

    { f2 = f - 1 }

# Polynom used

result =    -14.381390672  * f2^11

             +42.007797122  * f2^9

             -76.704170257  * f2^7

             +81.605223686  * f2^5

             -41.341702104  * f2^3

              +6.2831853069 * f2

And the result of sin(x) is then in 'result'.

The precision is not fully of what the C64 numeric storage permits. But close enough it seems. This seems to be how Commodore 64 Basic get sin(x) done. In essence a polynomial and many other math functions seems to rely on them too. One function missing in C64 seems to be acos().

Of course f2 can be calculated with "f2  = x / (2*pi)" too, instead of 3 statements (if I recall it correctly).

The question is then how to get to these polynomials, and can they be improved? All this of course matters whenever math is needed where the CPU doesn't have the function you need, nor is there any library to get it done. And the number format for existing libraries can be sub optimal for the task at hand.

 


Has anybody profiled the math library?

Posted: Tue Dec 27, 2022 5:51 pm
by kelli217

Complex math on the 6502 series of processors is always going to be slow. It's only got an 8-bit accumulator. It doesn't have any hardware multiplication. I agree with the suggestion to use precalc tables.


Has anybody profiled the math library?

Posted: Wed Dec 28, 2022 3:27 pm
by Michael Kaiser


On 12/24/2022 at 7:57 PM, neutrino said:




Regarding math function. I did some testing 6 days ago trying to replicate the Commodore 64 sin() function:



Source: https://www.c64-wiki.com/wiki/POLY1



If 'x' is in radians.



x2 = x / (2*pi)

i  = int(x2)

f  = x2 - i



if( 0 <= f <= 0.25 )

    { f2 = f }

elsif( 0.25 < f <= 0.75 )

    { f2 = 0.5 - f }

elsif( 0.75 < f < 1 )

    { f2 = f - 1 }



# Polynom used



result =    -14.381390672  * f2^11

             +42.007797122  * f2^9

             -76.704170257  * f2^7

             +81.605223686  * f2^5

             -41.341702104  * f2^3

              +6.2831853069 * f2



And the result of sin(x) is then in 'result'.



The precision is not fully of what the C64 numeric storage permits. But close enough it seems. This seems to be how Commodore 64 Basic get sin(x) done. In essence a polynomial and many other math functions seems to rely on them too. One function missing in C64 seems to be acos().



Of course f2 can be calculated with "f2  = x / (2*pi)" too, instead of 3 statements (if I recall it correctly).



The question is then how to get to these polynomials, and can they be improved? All this of course matters whenever math is needed where the CPU doesn't have the function you need, nor is there any library to get it done. And the number format for existing libraries can be sub optimal for the task at hand.



 



I used the CX-16 floating point library to build a SIN and COS table.  I'm using it to plot cartesian coordinates given polar coordinates.   So I divided the circle into 16 parts, so the only valid values for theta are 0 - 15.  I also only allow 0-15 as valid values for r.  Then I built a table of "sin_table[theta*16+r] = sin(theta)*r".  Did the same for COS.  So now to plot the coordinate all I do is this:


      ; X = theta * 16 + r



      lda theta



      rol



      rol



      rol



      rol



      clc



      adc r            



      tax

 


      ; result = sin_table[x]



      lda sin_table,X



      sta result