Oxygen Basic
Information => Development => Topic started by: Mike Lobanovsky on September 16, 2014, 10:39:57 AM
-
Hi Charles,
I was rather upset with the speed of my DynC floating point maths that depends on the calls to the msvcrt.dll runtime. So I decided to search the net for fast approximation functions that I might use in my code instead of their exact precision counterparts from the runtime library.
I succeeded in finding some that suit my 2D and 3D drawing tasks and run 5 to 8 times faster with acceptable precision.
When I however tried to use them in Oxygen scripts, they appeared to run much slower than I would expect. Please consider the following script (it contains references to the equivalent FBSL DynC functions) and the accompanying snapshot. I also discovered that O2 quads cannot be left or right shifted.
Why is all that, do you think?
Thanks.
INCLUDEPATH "$\inc\"
INCLUDE "Console.inc"
! FUNCTION GetTickCount LIB "KERNEL32.DLL" () AS DWORD
DWORD gtc, i
double d
function my_sqrt(float v) as float
float vhalf = 0.5 * v
union utype {
float f
int i
}
dim u as utype
u.f = v
u.i = 0x5F325A82 - (u.i >> 1) // initial guess
return v * u.f * (1.50015 - vhalf * u.f * u.f) // Newton step
end function
function my_exp(double v) as double
double d
int* q = @d + 4
q[1] = (int)(1512775.0 * v + 1072632447.0)
return d
end function
function my_log(float v) as float
union utype {
float f
int i
}
dim u as utype
u.f = v
float y = u.i
int shift = 1 << 23
y *= 1.0 / shift
return 0.69314718 * (y - 126.94269504)
end function
gtc = GetTickCount()
for i = 0 to 100000000
d = sqrt(5)
next
print "SQRT done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
gtc = GetTickCount()
for i = 0 to 100000000
d = my_sqrt(5)
next
print "MYSQRT done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
gtc = GetTickCount()
for i = 0 to 100000000
d = exp(5)
next
print "EXP done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
gtc = GetTickCount()
for i = 0 to 100000000
d = my_exp(5)
next
print "MYEXP done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
gtc = GetTickCount()
for i = 0 to 100000000
d = log(5)
next
print "LOG done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
gtc = GetTickCount()
for i = 0 to 100000000
d = my_log(5)
next
print "MYLOG done in " (GetTickCount() - gtc) / 1000 " sec ==> " d cr
print cr "Press any key to continue ..."
waitkey
' // DynC Equivalents
' float my_sqrt(const float x) { // 8 times faster than MSVCRT, 6 times faster than in O2
' const float xhalf = 0.5f * x;
' union {
' float f;
' int i;
' } u = {x};
' u.i = 0x5F325A82 - (u.i >> 1); // initial guess
' return x * u.f * (1.50015f - xhalf * u.f * u.f); // Newton step
' }
' double my_exp(double v) { // 5 times faster than MSVCRT, 3 times faster than in O2
' union {
' double d;
' long long l;
' } u;
' u.d = v;
' u.l = ((long long)(1512775 * v + 1072632447)) << 32;
' return u.d;
' }
' float my_log(float x){ // 7 times faster than MSVCRT, 3 times faster than in O2
' union { float f; int i; } vx = {x};
' float y = vx.i;
' y *= 1.0 / (1 << 23);
' return 0.69314718f * (y - 126.94269504f);
' }
.
-
Hi Mike,
Most intriguing approximations!
for square roots, you can't significantly improve on fsqrt, which is all that Oxygen uses.
But for exp and log, the main improvement is to assign the numeric constants to static variables, because loading most number literals into the FPU cannot be done directly. Another significant optimisation is to avoid automatic precedence nesting, and use #noprec which computes left to right. Finally, casting is a little cleaner than unions, in this context.
Try these two, and see what improvements you get:
function my_exp(double v) as double
'assume indexbase 1
static double k1=1512775.0
static double k2=1072632447.0
int q[2]
q[2] = k1 * v + k2
return (double) q
end function
int shift = 1 << 23
double ishift=1.0/shift
function my_log(float v) as float
static double k1 = 0.69314718
static double k2 = 126.94269504
float y=(int) v
'return k1 * (y*ishift - k2)
#noprec return y * ishift - k2 * k1
end function
Still not as good as your Dyn C though.
PS: O2 quads are performed in the FPU when in 32bit mode - no shift instructions implemented for the FPU.
-
Hehe Charles,
Actually moving all the constants out of function bodies really did the trick! In fact, Oxygen's my_log now runs even a little bit faster than in DynC! :D (see attachment 1)
As for sqrt, I wasn't so much after it though I'm glad I added it too to my palette. I was after its reciprocal counterpart that's in fact a fundamental corner stone of all 3D trigonometry -- the inverse square root, or 1 / sqrt. The canonical function that I'm using was designed by John Carmack for his immortal Quake III shooter. Here's its original code with the author's notes intact: :) (move const float threehalfs = 1.5F; to the global space if your C compiler doesn't regard the const qualifier as a directive to do it automatically)
float Q_rsqrt( float number )
{
long i;
float x2, y;
const float threehalfs = 1.5F;
x2 = number * 0.5F;
y = number;
i = * ( long * ) &y; // evil floating point bit level hacking
i = 0x5f3759df - ( i >> 1 ); // what the fuck?
y = * ( float * ) &i;
y = y * ( threehalfs - ( x2 * y * y ) ); // 1st iteration
// y = y * ( threehalfs - ( x2 * y * y ) ); // 2nd iteration, this can be removed
return y;
}
Surprisingly, the accuracy of approximation with just one Newton step (called "iteration" here) is not worse than 0.175% in the entire range of acceptable values! And it is significantly faster too than simple 1 / sqrt or 1 / my_sqrt. (see attachment 2)
These approximations are perfectly acceptable in present-day 3D. In fact a lot of modern video games use half-floats (2-byte floating-point values) for their geometry, texturing, and light including my favorite Resident Evil 4 thru 6 shooter. (come look John, she's starring the movies based on this game):
(http://cdn-premiere.ladmedia.fr/var/premiere/storage/images/cinema/news-cinema/milla-jovovich-et-le-realisateur-de-resident-evil-5-reagissent-aux-mauvaises-critiques-3507324/63784718-1-fre-FR/Milla-Jovovich-et-le-realisateur-de-Resident-Evil-5-reagissent-aux-mauvaises-critiques_portrait_w532.jpg)
Thanks for taking interest in the topic, Charles!
.
-
Hi Mike,
I have been checking out SIMD instructions.
There is an inverse square root instruction rsqrtss. It uses a newton approximation and should beat any hack!
float f=5
rsqrtss xmm0,f
movss f,xmm0
On my system it is about 3x faster than the FPU based sqrt()
PS: The SIMD instructions were first implemented on the Pentium 3, in 1999 - around the same time as the release of Quake III !
-
Well you know, FBSL's DynAsm doesn't support SSE instruction sets, only MMX w/ extensions and 3DNow!.
Not all our community members had MMX-capable Pentiums in their computers at that time and I remember that Gerome's computer certainly didn't have one either. We even compiled the FBSL binaries without MMX optimizations so that everyone could enjoy the new builds. Otherwise they were getting angry with the devs and with me in particular... :)
-
Most of the SIMD instructions are clean and simple to encode, though rather verbose with prefixes.
Some more thoughts on reciprocal square roots:
O2 defines all of its float functions as FPU macros in src/o2keyw.bas
definitions for recip and hypot
'data "hypot 17 fmul st0 : fxch : fmul st0 : faddp st1 : fsqrt"
'data "recip 17 fld1 : fdivrp"
Programmers can also encode at this level by using deff macros. It is easy to derive new float functions. The compiler preloads all the params onto the FPU stack.
deff rsqrt
fsqrt : fld1 : fdivrp
end deff
deff hypot3
fmul st0 : fxch : fmul st0 : faddp st1 : fxch : fmul st0 : faddp st1 : fsqrt
end deff
float a=9,b=2,c=3,d
d=rsqrt(a)
d=recip(hypot3(a,b,c))