Oxygen Basic

Information => Development => Topic started by: Mike Lobanovsky on September 16, 2014, 10:39:57 AM

Title: Fast Approximation Functions
Post by: Mike Lobanovsky on September 16, 2014, 10:39:57 AM: Hi Charles,

I was rather upset with the speed of my DynC floating point maths that depends on the calls to the msvcrt.dll runtime. So I decided to search the net for fast approximation functions that I might use in my code instead of their exact precision counterparts from the runtime library.

I succeeded in finding some that suit my 2D and 3D drawing tasks and run 5 to 8 times faster with acceptable precision.

When I however tried to use them in Oxygen scripts, they appeared to run much slower than I would expect. Please consider the following script (it contains references to the equivalent FBSL DynC functions) and the accompanying snapshot. I also discovered that O2 quads cannot be left or right shifted.

Why is all that, do you think?

Thanks.

Code: [Select]
INCLUDEPATH "$\inc\" INCLUDE "Console.inc" ! FUNCTION GetTickCount LIB "KERNEL32.DLL" () AS DWORD DWORD gtc, i double d function my_sqrt(float v) as float float vhalf = 0.5 * v union utype { float f int i } dim u as utype u.f = v u.i = 0x5F325A82 - (u.i >> 1) // initial guess return v * u.f * (1.50015 - vhalf * u.f * u.f) // Newton step end function function my_exp(double v) as double double d int* q = @d + 4 q[1] = (int)(1512775.0 * v + 1072632447.0) return d end function function my_log(float v) as float union utype { float f int i } dim u as utype u.f = v float y = u.i int shift = 1 << 23 y *= 1.0 / shift return 0.69314718 * (y - 126.94269504) end function gtc = GetTickCount() for i = 0 to 100000000 d = sqrt(5) next print "SQRT done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr gtc = GetTickCount() for i = 0 to 100000000 d = my_sqrt(5) next print "MYSQRT done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr gtc = GetTickCount() for i = 0 to 100000000 d = exp(5) next print "EXP done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr gtc = GetTickCount() for i = 0 to 100000000 d = my_exp(5) next print "MYEXP done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr gtc = GetTickCount() for i = 0 to 100000000 d = log(5) next print "LOG done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr gtc = GetTickCount() for i = 0 to 100000000 d = my_log(5) next print "MYLOG done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr print cr "Press any key to continue ..." waitkey ' // DynC Equivalents ' float my_sqrt(const float x) { // 8 times faster than MSVCRT, 6 times faster than in O2 ' const float xhalf = 0.5f * x; ' union { ' float f; ' int i; ' } u = {x}; ' u.i = 0x5F325A82 - (u.i >> 1); // initial guess ' return x * u.f * (1.50015f - xhalf * u.f * u.f); // Newton step ' } ' double my_exp(double v) { // 5 times faster than MSVCRT, 3 times faster than in O2 ' union { ' double d; ' long long l; ' } u; ' u.d = v; ' u.l = ((long long)(1512775 * v + 1072632447)) << 32; ' return u.d; ' } ' float my_log(float x){ // 7 times faster than MSVCRT, 3 times faster than in O2 ' union { float f; int i; } vx = {x}; ' float y = vx.i; ' y *= 1.0 / (1 << 23); ' return 0.69314718f * (y - 126.94269504f); ' }
.
Title: Re: Fast Approximation Functions
Post by: Charles Pegge on September 17, 2014, 01:45:56 AM: Hi Mike,

Most intriguing approximations!

for square roots, you can't significantly improve on fsqrt, which is all that Oxygen uses.

But for exp and log, the main improvement is to assign the numeric constants to static variables, because loading most number literals into the FPU cannot be done directly. Another significant optimisation is to avoid automatic precedence nesting, and use #noprec which computes left to right. Finally, casting is a little cleaner than unions, in this context.

Try these two, and see what improvements you get:
Code: [Select]
function my_exp(double v) as double 'assume indexbase 1 static double k1=1512775.0 static double k2=1072632447.0 int q[2] q[2] = k1 * v + k2 return (double) q end function int shift = 1 << 23 double ishift=1.0/shift function my_log(float v) as float static double k1 = 0.69314718 static double k2 = 126.94269504 float y=(int) v 'return k1 * (y*ishift - k2) #noprec return y * ishift - k2 * k1 end function
Still not as good as your Dyn C though.

PS: O2 quads are performed in the FPU when in 32bit mode - no shift instructions implemented for the FPU.
Title: Re: Fast Approximation Functions
Post by: Mike Lobanovsky on September 17, 2014, 04:21:52 AM: Hehe Charles,

Actually moving all the constants out of function bodies really did the trick! In fact, Oxygen's my_log now runs even a little bit faster than in DynC! :D (see attachment 1)

As for sqrt, I wasn't so much after it though I'm glad I added it too to my palette. I was after its reciprocal counterpart that's in fact a fundamental corner stone of all 3D trigonometry -- the inverse square root, or 1 / sqrt. The canonical function that I'm using was designed by John Carmack for his immortal Quake III shooter. Here's its original code with the author's notes intact: :) (move const float threehalfs = 1.5F; to the global space if your C compiler doesn't regard the const qualifier as a directive to do it automatically)

float Q_rsqrt( float number )
{
   long i;
   float x2, y;
   const float threehalfs = 1.5F;

   x2 = number * 0.5F;
   y = number;
   i = * ( long * ) &y; // evil floating point bit level hacking
   i = 0x5f3759df - ( i >> 1 ); // what the fuck?
   y = * ( float * ) &i;
   y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

   return y;
}

Surprisingly, the accuracy of approximation with just one Newton step (called "iteration" here) is not worse than 0.175% in the entire range of acceptable values! And it is significantly faster too than simple 1 / sqrt or 1 / my_sqrt. (see attachment 2)

These approximations are perfectly acceptable in present-day 3D. In fact a lot of modern video games use half-floats (2-byte floating-point values) for their geometry, texturing, and light including my favorite Resident Evil 4 thru 6 shooter. (come look John, she's starring the movies based on this game):

(http://cdn-premiere.ladmedia.fr/var/premiere/storage/images/cinema/news-cinema/milla-jovovich-et-le-realisateur-de-resident-evil-5-reagissent-aux-mauvaises-critiques-3507324/63784718-1-fre-FR/Milla-Jovovich-et-le-realisateur-de-Resident-Evil-5-reagissent-aux-mauvaises-critiques_portrait_w532.jpg)

Thanks for taking interest in the topic, Charles!

.
Title: Re: Fast Approximation Functions
Post by: Charles Pegge on September 17, 2014, 06:20:48 PM: Hi Mike,

I have been checking out SIMD instructions.

There is an inverse square root instruction rsqrtss. It uses a newton approximation and should beat any hack!

float f=5
rsqrtss xmm0,f
movss f,xmm0

On my system it is about 3x faster than the FPU based sqrt()

PS: The SIMD instructions were first implemented on the Pentium 3, in 1999 - around the same time as the release of Quake III !
Title: Re: Fast Approximation Functions
Post by: Mike Lobanovsky on September 17, 2014, 07:45:43 PM: Well you know, FBSL's DynAsm doesn't support SSE instruction sets, only MMX w/ extensions and 3DNow!.

Not all our community members had MMX-capable Pentiums in their computers at that time and I remember that Gerome's computer certainly didn't have one either. We even compiled the FBSL binaries without MMX optimizations so that everyone could enjoy the new builds. Otherwise they were getting angry with the devs and with me in particular... :)
Title: Re: Fast Approximation Functions
Post by: Charles Pegge on September 17, 2014, 08:22:49 PM: Most of the SIMD instructions are clean and simple to encode, though rather verbose with prefixes.

Some more thoughts on reciprocal square roots:

O2 defines all of its float functions as FPU macros in src/o2keyw.bas

definitions for recip and hypot

'data "hypot 17 fmul st0 : fxch : fmul st0 : faddp st1 : fsqrt"
'data "recip 17 fld1 : fdivrp"

Programmers can also encode at this level by using deff macros. It is easy to derive new float functions. The compiler preloads all the params onto the FPU stack.
Code: [Select]
deff rsqrt fsqrt : fld1 : fdivrp end deff deff hypot3 fmul st0 : fxch : fmul st0 : faddp st1 : fxch : fmul st0 : faddp st1 : fsqrt end deff float a=9,b=2,c=3,d d=rsqrt(a) d=recip(hypot3(a,b,c))