Author Topic: Fast Approximation Functions (Read 4729 times)

Mike Lobanovsky · « **on:** September 16, 2014, 10:39:57 AM »

Hi Charles,

I was rather upset with the speed of my DynC floating point maths that depends on the calls to the msvcrt.dll runtime. So I decided to search the net for fast approximation functions that I might use in my code instead of their exact precision counterparts from the runtime library.

I succeeded in finding some that suit my 2D and 3D drawing tasks and run 5 to 8 times faster with acceptable precision.

When I however tried to use them in Oxygen scripts, they appeared to run much slower than I would expect. Please consider the following script (it contains references to the equivalent FBSL DynC functions) and the accompanying snapshot. I also discovered that O2 quads cannot be left or right shifted.

Why is all that, do you think?

Thanks.

Code: [Select]

  INCLUDEPATH "$\inc\"
  INCLUDE "Console.inc"
  
  ! FUNCTION GetTickCount     LIB "KERNEL32.DLL" () AS DWORD
  
  DWORD gtc, i
  double d
  
  function my_sqrt(float v) as float
    float vhalf = 0.5 * v
    union utype {
      float f
      int i
    }
    dim u as utype
    u.f = v
    u.i = 0x5F325A82 - (u.i >> 1) // initial guess
    return v * u.f * (1.50015 - vhalf * u.f * u.f) // Newton step
  end function
  
  function my_exp(double v) as double
    double d
    int* q = @d + 4
    q[1] = (int)(1512775.0 * v + 1072632447.0)
    return d
  end function
  
  function my_log(float v) as float
    union utype {
      float f
      int i
    }
    dim u as utype
    u.f = v
    float y = u.i
    int shift = 1 << 23
    y *= 1.0 / shift
    return 0.69314718 * (y - 126.94269504)
  end function

  gtc = GetTickCount()
  for i = 0 to 100000000
    d = sqrt(5)
  next
  print "SQRT done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
  gtc = GetTickCount()
  for i = 0 to 100000000
    d = my_sqrt(5)
  next
  print "MYSQRT done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
  
  gtc = GetTickCount()
  for i = 0 to 100000000
    d = exp(5)
  next
  print "EXP done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
  gtc = GetTickCount()
  for i = 0 to 100000000
    d = my_exp(5)
  next
  print "MYEXP done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
  
  gtc = GetTickCount()
  for i = 0 to 100000000
    d = log(5)
  next
  print "LOG done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
  gtc = GetTickCount()
  for i = 0 to 100000000
    d = my_log(5)
  next
  print "MYLOG done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr

  print cr "Press any key to continue ..."
  waitkey
  
  ' // DynC Equivalents
  ' float my_sqrt(const float x) { // 8 times faster than MSVCRT, 6 times faster than in O2
    ' const float xhalf = 0.5f * x;
    ' union {
      ' float f;
      ' int i;
    ' } u = {x};
    ' u.i = 0x5F325A82 - (u.i >> 1); // initial guess
    ' return x * u.f * (1.50015f - xhalf * u.f * u.f); // Newton step
  ' }
  
  ' double my_exp(double v) { // 5 times faster than MSVCRT, 3 times faster than in O2
    ' union {
      ' double d;
      ' long long l;
    ' } u;
    ' u.d = v;
    ' u.l = ((long long)(1512775 * v + 1072632447)) << 32;
    ' return u.d;
  ' }
  
  ' float my_log(float x){ // 7 times faster than MSVCRT, 3 times faster than in O2
    ' union { float f; int i; } vx = {x};
    ' float y = vx.i;
    ' y *= 1.0 / (1 << 23);
    ' return 0.69314718f * (y - 126.94269504f);
  ' }

.

Charles Pegge · « **Reply #1 on:** September 17, 2014, 01:45:56 AM »

Hi Mike,

Most intriguing approximations!

for square roots, you can't significantly improve on fsqrt, which is all that Oxygen uses.

But for exp and log, the main improvement is to assign the numeric constants to static variables, because loading most number literals into the FPU cannot be done directly. Another significant optimisation is to avoid automatic precedence nesting, and use #noprec which computes left to right. Finally, casting is a little cleaner than unions, in this context.

Try these two, and see what improvements you get:

Code: [Select]

  function my_exp(double v) as double
    'assume indexbase 1
    static double k1=1512775.0
    static double k2=1072632447.0
    int q[2]
    q[2] = k1 * v + k2
    return (double) q
  end function
  
  int shift = 1 << 23
  double ishift=1.0/shift

  function my_log(float v) as float
    static double k1 = 0.69314718
    static double k2 = 126.94269504
    float y=(int) v
    'return k1 * (y*ishift - k2)
    #noprec return y * ishift - k2 * k1
  end function

Still not as good as your Dyn C though.

PS: O2 quads are performed in the FPU when in 32bit mode - no shift instructions implemented for the FPU.

Mike Lobanovsky · « **Reply #2 on:** September 17, 2014, 04:21:52 AM »

Hehe Charles,

Actually moving all the constants out of function bodies really did the trick! In fact, Oxygen's my_log now runs even a little bit faster than in DynC!

(see attachment 1)

As for sqrt, I wasn't so much after it though I'm glad I added it too to my palette. I was after its reciprocal counterpart that's in fact a fundamental corner stone of all 3D trigonometry -- the inverse square root, or 1 / sqrt. The canonical function that I'm using was designed by John Carmack for his immortal Quake III shooter. Here's its original code with the author's notes intact:

(move const float threehalfs = 1.5F; to the global space if your C compiler doesn't regard the const qualifier as a directive to do it automatically)

float Q_rsqrt( float number )
{
   long i;
   float x2, y;
   const float threehalfs = 1.5F;

   x2 = number * 0.5F;
   y = number;
   i = * ( long * ) &y; // evil floating point bit level hacking
   i = 0x5f3759df - ( i >> 1 ); // what the fuck?
   y = * ( float * ) &i;
   y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed

   return y;
}

Surprisingly, the accuracy of approximation with just one Newton step (called "iteration" here) is not worse than 0.175% in the entire range of acceptable values! And it is significantly faster too than simple 1 / sqrt or 1 / my_sqrt. (see attachment 2)

These approximations are perfectly acceptable in present-day 3D. In fact a lot of modern video games use half-floats (2-byte floating-point values) for their geometry, texturing, and light including my favorite Resident Evil 4 thru 6 shooter. (come look John, she's starring the movies based on this game):

Thanks for taking interest in the topic, Charles!

.

Charles Pegge · « **Reply #3 on:** September 17, 2014, 06:20:48 PM »

Hi Mike,

I have been checking out SIMD instructions.

There is an inverse square root instruction rsqrtss. It uses a newton approximation and should beat any hack!

float f=5
rsqrtss xmm0,f
movss f,xmm0

On my system it is about 3x faster than the FPU based sqrt()

PS: The SIMD instructions were first implemented on the Pentium 3, in 1999 - around the same time as the release of Quake III !

Mike Lobanovsky · « **Reply #4 on:** September 17, 2014, 07:45:43 PM »

Well you know, FBSL's DynAsm doesn't support SSE instruction sets, only MMX w/ extensions and 3DNow!.

Not all our community members had MMX-capable Pentiums in their computers at that time and I remember that Gerome's computer certainly didn't have one either. We even compiled the FBSL binaries without MMX optimizations so that everyone could enjoy the new builds. Otherwise they were getting angry with the devs and with me in particular...

Charles Pegge · « **Reply #5 on:** September 17, 2014, 08:22:49 PM »

Most of the SIMD instructions are clean and simple to encode, though rather verbose with prefixes.

Some more thoughts on reciprocal square roots:

O2 defines all of its float functions as FPU macros in src/o2keyw.bas

definitions for recip and hypot

'data "hypot 17 fmul st0 : fxch : fmul st0 : faddp st1 : fsqrt"
'data "recip 17 fld1 : fdivrp"

Programmers can also encode at this level by using deff macros. It is easy to derive new float functions. The compiler preloads all the params onto the FPU stack.

Code: [Select]

deff rsqrt
fsqrt : fld1 : fdivrp
end deff

deff hypot3
fmul st0 : fxch : fmul st0 : faddp st1 : fxch : fmul st0 : faddp st1  : fsqrt
end deff

float a=9,b=2,c=3,d

d=rsqrt(a)

d=recip(hypot3(a,b,c))

Oxygen Basic

News:

Author Topic: Fast Approximation Functions (Read 4729 times)

Mike Lobanovsky

Fast Approximation Functions

Charles Pegge

Re: Fast Approximation Functions

Mike Lobanovsky

Re: Fast Approximation Functions

Charles Pegge

Re: Fast Approximation Functions

Mike Lobanovsky

Re: Fast Approximation Functions

Charles Pegge

Re: Fast Approximation Functions