Author Topic: Optimization  (Read 2486 times)

0 Members and 1 Guest are viewing this topic.

Peter

  • Guest
Optimization
« on: January 14, 2014, 04:26:55 AM »
Deleted
« Last Edit: April 25, 2015, 03:19:28 AM by Peter »

Charles Pegge

  • Guest
Re: Optimization
« Reply #1 on: January 15, 2014, 04:48:58 AM »
This will shave a further 20-25% off the optimised code time. (I think this is close to the ultimate speed limit :))
Code: [Select]
! GetTickCount Lib "kernel32.dll" () As Long

indexbase 0
int a[4000000]
single t1,t2

t1 = GetTickCount
for j=0 to 400
    addr edx,a
    mov ecx,0
    (
      mov [edx],ecx
      inc ecx
      add edx,sizeof a
      cmp ecx,4000000
      jl repeat
    )
next
t2 = GetTickcount
t2 = (t2-t1)/1000
mbox "time:  "  str(t2,6)

Mike Lobanovsky

  • Guest
Re: Optimization
« Reply #2 on: January 15, 2014, 08:26:26 AM »
@Charles

Thanks again for your heads-up and a fix regarding the indirection issue!

@Peter

In fact, I was asking about indirection in relation to this very topic. In C, accessing variables' values (in this case, in an array) through an explicit pointer and an offset (like in *(p+4)) is noticeably faster than using equivalent array indexing (e.g. a[0 + 4]). Charles' C-style Oxygen code is likely to resolve to assembly in ways similar to C so this tweak may also work for Oxygen..

So my intention was to submit an early alternative to your original unrolled loop with yet faster benchmarks using C-style indirection. ;) Regretfully I failed due to this annoying bug in pointer arithmetics. I was unlucky to have downloaded my package from the site home page's Alpha Downloads link which evidently didn't point in the right direction at that time.

Don't overestimate assembler code complexity. Once you make an effort and have a closer look at it, you'll be amazed how logical, pure and elegant it is even if your listings (scripts) are going to become a trifle longer. :)

Charles Pegge

  • Guest
Re: Optimization
« Reply #3 on: January 15, 2014, 08:50:44 AM »
Allow me to disclose a secret tool :)

#show statement

will reveal both the intermediate code and its translation to assembly code for most statements.

I put a fair amount of effort into optimizing indexes, so I think you will find that they compete very well with pointer expressions.

Code: [Select]
sys a[100]
sys b,p
sys i=1
#show b=a[10]
#show b=a[i]
#show b=a[i+10]
p=@a
#show b=*(p+i*4+10)

« Last Edit: January 15, 2014, 10:24:17 AM by Charles Pegge »

Mike Lobanovsky

  • Guest
Re: Optimization
« Reply #4 on: January 15, 2014, 10:37:54 AM »
Very, very handy Charles!

FBSL isn't open-source so we don't dare plant secret stuff into public builds but rather everyone that's doing applied assembly in FBSL can PM me for a custom build with an extra Asm Logger window capability. It allows one to inspect the entire build process line by line with final statistics, label usage, NOP padding of forward references, prognosticated and de-facto opcode size, and total build speed (actually rather low due to line-by-line display of logged data).

Not that many users applied though ... :) Here's how the Asm Logger window looks for the final build stage of a tiny asm block in the Eclecta Highlighting Editor's code:




And in the end, can I ask you for just one more little secret? If yes then does Oxygen's asm use code and data alignment?


P.S. I forgot to state that my hope for better performance using pointers and offsets instead of array indices was futile in principle. Very well done Charles, thanks!  8)
« Last Edit: January 15, 2014, 10:47:48 AM by Mike Lobanovsky »

Charles Pegge

  • Guest
Re: Optimization
« Reply #5 on: January 15, 2014, 12:15:58 PM »
Yes Mike, both data and code are automatically aligned. At least that is my intention.

arrays with a constant index, like a[40] are resolved to a fixed offset at compile time, so it looks exactly like a simple variable.

[EBX+offset]


arrays with a simple index variables, like a[ i] are generally SIB encoded, with the indexer loaded into the ESI register, thus:

[EBX+ESI*4+offset]

All other types of indexes are pre-calculated and held in temp variables before the main expression is performed.

It is possible to extract blocks of both o2 script (unlinked partial machine code) and corresponding assembly code. This is best done with the exo2 compiler in a dos console:

t.o2bas
Code: [Select]
###
function f(sys a) as sys
return a*2
end function

f 3
###

exo2 -b t.o2bas>t.txt

t.txt
Code: [Select]
                                '  '_2
                                '  'FUNCTION F
 E9 gf _over_                   '  jmp fwd _over_
!10
.f#sys                          '  .f#sys
 (                              '  (
 55                             '  push ebp
 8B EC                          '  mov ebp,esp
 83 C4 F0                       '  add esp,-16
 8D 7D F0                       '  lea edi,[ebp-0x10]
 33 C0                          '  xor eax,eax
 89 07                          '  mov [edi],eax
 8B 45 08                       '  mov eax,[ebp+0x8]
 6B C0 02                       '  imul eax,eax,2
 E9 gf _return_                 '  jmp fwd _return_
                                '  '_4
._exit_                         '  ._exit_
 8B 45 F0                       '  mov eax,[ebp-0x10]
._return_                       '  ._return_
 8B E5                          '  mov esp,ebp
 5D                             '  pop ebp
 C2 04 00                       '  ret 4
 )                              '  )
._over_                         '  ._over_
                                '  '_6
 6A 03                          '  push 3
 E8 gl f#sys                    '  call f#sys
                                '  '_7
._end_

This is more useful for compiler checking than developing user code.

Kuron

  • Guest
Re: Optimization
« Reply #6 on: January 15, 2014, 01:48:29 PM »
... but rather everyone that's doing applied assembly in FBSL can PM me for a custom build with an extra Asm Logger window capability. It allows one to inspect the entire build process line by line with final statistics, label usage, NOP padding of forward references, prognosticated and de-facto opcode size, and total build speed (actually rather low due to line-by-line display of logged data).

Very nice!