Author Topic: Tiny Benchmark Test  (Read 34841 times)

0 Members and 1 Guest are viewing this topic.

Charles Pegge

  • Guest
Re: Tiny Benchmark Test
« Reply #30 on: May 15, 2014, 02:54:42 AM »
Thanks Mike,

There are no further optimisations I can think of. We are down to manual assembler, which convinces me that the optimised C code employs the services of a least 2 cores, to be able to produce timings less than half of mine.

Jump Tables and Tokens:

maybe not so simple in C:
Code: [Select]
'SETUP
======
sys j
sys t[]={@AA,@BB,@CC} 'map label addreses into table

'TEST
=====

token=2
j=t[token] : jmp j

jmp fwd done

AA:
print "AA"
jmp fwd done

BB:
print "BB"
jmp fwd done

CC:
print "CC"
jmp fwd done

done:
« Last Edit: May 15, 2014, 03:15:41 AM by Charles Pegge »

Mike Lobanovsky

  • Guest
Re: Tiny Benchmark Test
« Reply #31 on: May 15, 2014, 03:26:52 AM »
... the optimised C code employs the services of a least 2 cores ...
Possibly though not directly via multi-threading. Also, there may be transparent optimizations with MMX or SSE+ facilities involved.

My Ida Pro shows that the VC12 optimizer uses MMX (GCC v4.3.3 doesn't) to enhance the test code performance. Then, apart from code alignment there's also such a thing as proper UV-paring of Pentium instructions. So your handcrafted assembly still has a chance to improve on its odds... :)

Jump Tables and Tokens:
....................
maybe not so simple in C:
....................
Yet perfectly feasible for GCC's crooked AT&T assembly ( ;) ) provided parameter passing is added to the scheme. Computed gotos are also part of C99 standard and fully supported in GCC (and in FBSL's Dynamic C too BTW).

Thanks a lot!
« Last Edit: May 15, 2014, 04:06:27 AM by Mike Lobanovsky »

Ed Davis

  • Guest
Re: Tiny Benchmark Test
« Reply #32 on: May 15, 2014, 05:13:23 AM »
I'm using Euphoria's latest build, v4.1 for Windows from their repository. Is that what you've used for the test?

I used 4.04.  But I should have used 3.1 :)



That 3.1 time is very impressive!  And note: exwc is the Windows Console interpreter.  The name was changed in the 4.0 line.

Mike Lobanovsky

  • Guest
Re: Tiny Benchmark Test
« Reply #33 on: May 15, 2014, 05:33:20 AM »
Ed, ;D

This is the most demonstrative demotivator to justify once again what I've always said:

COMMUNITY DRIVEN EQUAL-OPPORTUNITY PROJECTS ARE A HIGHWAY TO HELL!






[EDIT] In the same vein:

Petr Schreiber of thinBasic has just published a message about MS C# and VB.NET compiler code going open source.
« Last Edit: May 15, 2014, 07:00:33 AM by Mike Lobanovsky »

JRS

  • Guest
Re: Tiny Benchmark Test
« Reply #34 on: May 15, 2014, 08:05:35 AM »
Mike,

Thanks for your explanation and clarification of the trade-offs of typeless variables. By default, Script BASIC's variables are thread safe and uses Peter Verhas's MyAlloc for the memory manager/garbage collector. With a change of a #define, SB can use the standard C malloc/free if SB is only going to be used in a single process environment. I may generate a scriba in this model just for performance testing and comparison.
 

JRS

  • Guest
Re: Tiny Benchmark Test
« Reply #35 on: May 15, 2014, 08:35:01 AM »
Quote from: Mike
Petr Schreiber of thinBasic has just published a message about MS C# and VB.NET compiler code going open source.

.NET jQuery package - 6,080,575 downloads from nuget.

I don't think we are playing in the same ballpark.  :-\

Mike Lobanovsky

  • Guest
Re: Tiny Benchmark Test
« Reply #36 on: May 15, 2014, 09:48:36 AM »
We certainly aren't as public significance of what we and they are doing is definitively incongruous.

I think open source in both cases may be just a marketing ploy to increase the popularity of either platform among the ranks of independent developers. The more they trust the instruments they are given, the larger their incentive to use them on the respective platform and pay for that in full. Note well the project owners stay immutable in both cases regardless of who actually contributes to the respective products. Sort of public trust management of private assets, hehe...

Still I'm feeling again like I'm talking about matters that aren't exactly within my sphere of competence. :)


Aurel

  • Guest
Re: Tiny Benchmark Test
« Reply #37 on: May 16, 2014, 01:34:01 AM »
Quote
And do not try to beat Java and .NET. They are cheaters: they run the JIT source that's been heavily optimized when precompiled slowly for the first time and written into a disk file.

good to know Mike...and thanks for real results...
also i will add that euphoria is on the same track as Java or .Net or is really that optimized
that is faster than other bytecode interpreters.

Mike Lobanovsky

  • Guest
Re: Computed Goto's in DynC and GCC
« Reply #38 on: May 16, 2014, 02:28:45 AM »
.......
Jump Tables and Tokens:

maybe not so simple in C:
Code: [Select]
'SETUP
======
sys j
sys t[]={@AA,@BB,@CC} 'map label addreses into table

'TEST
=====

token=2
j=t[token] : jmp j

jmp fwd done

AA:
print "AA"
jmp fwd done

BB:
print "BB"
jmp fwd done

CC:
print "CC"
jmp fwd done

done:

Hi Charles,

GNU C supports computed goto's so your scheme can be used literally in both FBSL's Dynamic C and GCC alike. Both languages resolve goto to a jmp instruction. The scheme now works for the intended purposes in the FBSL source code.

Thanks again!

.

Charles Pegge

  • Guest
Re: Tiny Benchmark Test
« Reply #39 on: May 16, 2014, 03:16:18 AM »
Excellent Mike, The best possible performance can be obtained for token code. Calling functions has a significant overhead.

You can also optimise integer processing for simple variables and common tasks like iteration.

Assuming you don't want to take the JIT path, of course. I can see the portability advantage of not dealing with specific processors.

Mike Lobanovsky

  • Guest
Re: Tiny Benchmark Test
« Reply #40 on: May 16, 2014, 06:12:54 AM »
You can also optimise integer processing for simple variables and common tasks like iteration.

No way, Charles. There are no simple variables in FBSL's BASIC:
 
1. Every little thing is at least a 16-byte COM-style Variant. Besides the Variant proper, strings and compound data types grow sideways branches of associated readable/writable pool memory.

2. Functions grow a downwards tree-like sub-script of parameters, local Variant variables and code proper while the top Variant hosts the return value. DynAsm and DynC blocks are framed as ordinary functions with a sideways branch of associated executable pool memory filled with machine code bytes.

3. Namespaces and/or class instances are tree-like sub-scripts of their own that replicate the general structure of an FBSL main script .

4. Main script is a huge tree wherein:

-- the top Variant hosts the app exit code;
-- the trunk is made up of global Variant variables and global code, if any; and
-- the branches are function sub-trees, and/or namespaces, and/or class instances, if any, with their own functions/methods as described in Item 2.

An FBSL program's memory footprint is like a Christmas fir-tree - very attractive to look at but extremely prickly to the touch whenever you want to change the sources. ;D

Quote
Assuming you don't want to take the JIT path, of course.

FBSL v3.5's BASIC will stay as it is. FBSL v4's BASIC will be a strong data type JIT compiler - no more Variants except for the COM layer.

Quote
I can see the portability advantage of not dealing with specific processors.

Hmmm... That is, in the core engine? Yet there can be a range of connectable modules responsive to a call to IsProcessorFeaturePresent() - that's what our "big brothers" have (e.g. MS VC and GCC). And I think it will be reasonable to target Intel architectures for Windows/Linux/Mac mainly, at least at this stage of project development.

Charles Pegge

  • Guest
Re: Tiny Benchmark Test
« Reply #41 on: May 16, 2014, 10:02:17 PM »
Hi Mike,

I found this lecture very informative from the development angle. Have you seen it before?


Bjarne Stroustrup 2012 keynote presentation on C++11

http://channel9.msdn.com/Events/GoingNative/GoingNative-2012/Keynote-Bjarne-Stroustrup-Cpp11-Style

In particular, his discussion of linked-structures vs linear structures (around 46:00)
« Last Edit: May 16, 2014, 10:39:01 PM by Charles Pegge »

Aurel

  • Guest
Re: Tiny Benchmark Test
« Reply #42 on: May 17, 2014, 02:46:37 PM »
Is this what you call ' computed goto ' ?

Quote
Handler Chaining

What I have never seen a C or C++ compiler do for such interpreter
loop code is make one further optimization.
 What is if instead of generating the jump instruction to the top of the "for" loop the compiler
was smart enough to simply compile in the fetch and dispatch into each handler?
 In other words, if there was some kind of funky "goto" keyword syntax
that would allow you to write this in your source code to hint to the compiler to do that:

        switch(peekb(PC++))
            {
        default: /* undefined opcode! treat as nop */
        case opNop:
            goto case(peekb(PC++));

        case opIncX:
            X++;
            goto case(peekb(PC++));

        case opLdaAbs16:
            addr = peekw(PC);
            PC += 2;
            A = peekb(addr);
            goto case(peekb(PC++));

You would now have an interpreter loop that simply
jumped from instruction handler to instruction handler
without even looping. Unless I missed something obvious,
C and C++ lack the syntax to specify this design pattern,
and the optimizing compilers don't catch it. It is for this reason that for all
of my virtual machine projects over the past 20+ years I have resorted to using
assembly language to implement CPU interpreters. Because in assembly language you
can have a calculated jump target that you branch to from the end of each handler,
 as this x86 example code which represents the typical instruction dispatch code used
in most past versions of Gemulator and SoftMac and Fusion PC:

    movzx ebx,word ptr gs:[esi]
    add esi,2
    jmp fs:[ebx*4]

Aurel

  • Guest
Re: Tiny Benchmark Test
« Reply #43 on: May 17, 2014, 03:08:13 PM »

Mike Lobanovsky

  • Guest
Re: Tiny Benchmark Test
« Reply #44 on: May 17, 2014, 09:41:54 PM »
@Charles:

Thanks a lot for the both links!

@Aurel:

No, it isn't.

The article Charles pointed to explains computed goto's very clearly. In ordinary C code, you don't know the memory address of a label. The jmp inctruction is hardcoded by the compiler at compile time based on the label's relative offset from the command (goto) that actually jumps to the label:

goto Label;
................. // some other code
Label:


But you still don't know what the real address of the label is.

Computed goto's allow the programmer to retrieve the real address of the label, store it e.g. in a table (i.e. in an array) or even perform some mathematic operations on it if necessary, and use the address or the resultant math value as the target for the instruction that actually jumps to the label or address computed in a math expression. The goto syntax in this case is slightly different:

void *target = &&Label; // && retrieves Label address
target += 100; // add 100 bytes to Label address
goto *target; // jump 100 bytes farther in memory than where Label points to
................. // some other code
Label:


The abstract you quoted comes from Ed's article and it only discusses briefly why switch isn't perfect for an ideal interpreting loop. Yet Charles' article does it in much deeper detail and in better English.

I would however like to add two points to what is written in both articles:

1. The usual break causes two jumps to be performed in this hypothetic loop:

// there's an implicit BeginWhile label here
while (1) {
    switch (code[pc++]) {
    case OP_ONE:
        DoSomething();
        break;
    default:
        DoNothing();
    } // there's an implicit EndSwitch label here
} // there's an implicit EndWhile label here


- first, to the implicit EndSwitch label, and then, from the EndWhile label to the BeginWhile label where while (1) resides.

Assuming there's nothing to execute in between the EndSwitch and EndWhile labels, a more reasonable solution will be to use continue instead of break - it will spare one extra jump:

// there's an implicit BeginWhile label here
while (1) {
    switch (code[pc++]) {
    case OP_ONE:
        DoSomething();
        continue;
    default:
        DoNothing();
    } // there's an implicit EndSwitch label here
} // there's an implicit EndWhile label here


- here continue jumps directly to the implicit BeginWhile label.

2. All these considerations and timings matter only if the code within DoSomething() and DoNothing() is well-optimized, short and very very fast. If it isn't, you won't see any noticeable or measurable improvement from changing break's to continue's, or the entire switch block, to computed goto's and a jump table. :)