Author Topic: Idea for a scanner and ? about translating to OB (Read 2664 times)

jumpandrun · « **on:** May 11, 2012, 03:18:34 AM »

Recently I made up my mind about how to quickly scan a (source) string for keywords, i.e. like when writing a compiler (scanner).

Now, I am not very versed in school-book compiler techniques, so am doing it 'my way'. I have already created a compiler before (written in freebasic btw) that compiles a source in to C; and it did so pretty fast, compiling about a million lines of source code per second.
I used hashes and binary search method to identify keywords. But I think that my new technique would be faster.
I am courious what anybody else thinks about this.
However, I am not sure if a similiar technique already exists and I am only re-inventing the wheel here :D — likely it does since it is basically a form of tree traversal.

The technique explained:
I want to 'eliminate' the search and rather 'instantly' ('on the fly') find if an identifier is a keyword. Keywords consist of characters, the idea is that those characters form a path — indexes into a hierarchy of arrays (alphabets), that I populate beforehand in the appropriate way. Note: Some keywords may be subsets of other keywords, i.e. "for" occurs in "foreach"; in order to work one has two options and must strictly adhere to one of them when populating the structure: In such a case the precedence when populating is either from the shortes to the longest keyword or vice versa. I have chosen longest before shortes.

So, when I scan a source character by character, I simply traverse the hierarchy of arrays, and, if the traversal does not meet a dead end or does not end up with a wrong length, it is a keyword.
The 'length' parameter is necessary to resolve disambiguities, i.e. "for" and "foreach" would traverse along the same path but not to the same level.

Also the technique trades memory for speed and the 'garbage collector' is a little tricky too — the allocated memory has to be freed recursively.
Of course this is not a problem in environments with an automatic garbage collection :D

So far I have implemented the code in C (and Java) and am having trouble translating it into OB; that is, I do not want to produce memory holes with trying this and that, as the pointers-thing may work differently.
If someone could assist with it I would appreciate (see second code below, my version so far — not tested).
C code:

Code: [Select]


// scan.c

#include <stdio.h>
#include <stdlib.h>

#define TRUE -1
#define FALSE 0

struct N
{
	struct N* n;
	int l;
};

void populate(void); //init structure
int match(char*);
void gc(struct N*); //garbage collect

struct N* h;
struct N* a;
char* keywords[] = {"fi", "foreach", "for"};

int main()
{
	h = (struct N*)calloc(26, sizeof(struct N));
	char* keyword = "foreach";
	populate();
	if (match(keyword)) printf("match: \"%s\"\n", keyword); else printf("no match: \"%s\"\n", keyword);
	gc(h);
	free(h);
	return(0);
}

void populate()
{
	int k, l;
	for (k = 0; k < sizeof(keywords) / sizeof(char*); k++)
	{
		char* c;
		l = -1;
		for (c = keywords[k]; *c; c++) l++; //strlen
		a = h;
		for (c = keywords[k]; *c; c++)
		{
			a += *c - 'a';
			if (a->n == NULL) a->n = (struct N*)calloc(26, sizeof(struct N));
			a->l = l--;
			a = a->n;
		}
	}
}

int match(char* kword)
{
	int l = -1;
	a = h;
	char* c;
	for (c = kword; *c; c++)
	{
		if (*c < 'a' || *c > 'z') return(FALSE);
		a += *c - 'a';
		if (a->n == NULL) return(FALSE);
		l = a->l;
		a = a->n;
	}
	if (l != 0) return(FALSE);
	return(TRUE);
}

void gc(struct N* n)
{
	int i;
	for (i = 0; i < 'z' - 'a'; i++)
	{
		if (n->n != NULL)
		{
			gc(n->n);
			free(n->n);
		}
		++n;
	}
}

OB code:

Code: [Select]


// scan.bas
#case sensitive

#define TRUE -1
#define FALSE 0
#define ASCA 97
#define ASCZ 122
#lookahead
#indexbase 0

type N
	N* n
	int l
end type

N* h
N* a

string keywords[2] => ("end if", "foreach", "for")

function main as integer
	@h = getmemory 26 * sizeof N
	string keyword = "foreach"
	populate
	if match keyword
		print "match: " keyword
	else
		print "no match: " keyword
	end if
	gc @h
	freememory @h
end function

main

sub populate
	for k = 1 to spanof keywords
		l = len keywords[k] - 1
		@a = @h
		byte* c
		@c = *keywords[k]
		while c 
			&a += c - ASCA
			if @a.n == 0 then @a.n = getmemory 26 * sizeof N
			a.l = l--
			&a = &a.n
			@c++
		wend
	next
end sub

function match(string kword) as integer
	@a = @h
	byte* c
	@c = *kword
	while c
		if c < ASCA || c > ASCZ then return FALSE
		@a += c - ASCA
		if @a.n == 0 then return FALSE
		l = a.l
		@a = @a.n
		@c++
	wend
	if l != 0 then return FALSE
	return TRUE
end function

sub gc(N* n)
	for i = 0 to ASCZ - ASCA
		if @n.n != 0
			gc @n.n
			freememory @n.n
		end if
		@n++
	next
end sub

Peter · « **Reply #1 on:** May 11, 2012, 05:10:31 AM »

After I saw your source code, I think, OxygenBASIC is a misleading name.

jumpandrun · « **Reply #2 on:** May 11, 2012, 05:24:20 AM »

Made some corrections but still OB does not compile. No clue.

Hallo Peter, verstehe den Witz nicht

(do not understand the joke — do you mean because it is so much clearer? Yes it is, if it works.)

Charles Pegge · « **Reply #3 on:** May 11, 2012, 06:23:09 AM »

Peter means it does not look like Basic

I think I've got the concept though. This is feasible for a fixed set of words but would it cope with an indeterminate number of words and symbols?.

Charles

jumpandrun · « **Reply #4 on:** May 11, 2012, 07:22:38 AM »

No it is meant for a rather small set of keywords, i.e. like in a language like C.

Well, yes it does not look much basic-like, but this is always the case when you have to use pointers and such, I think (still it looks better in OB than in C imo).
In Java I can dynamically create new objects (arrays of objects that is), but I think in OB this is not possible (yet?).

Still I cannot compile it, OB complains about 'unidentified names', however they are obviously declared/defined :D

Basically the question is: what is the OB equivalent to a construct like

l = n->l (a variable is assigned the value of another variable via pointer)

or

a = a->n (a pointer is assigned the value of another pointer via pointer)

?

Charles Pegge · « **Reply #5 on:** May 11, 2012, 08:47:58 AM »

Once a pointered variable is set up with an address, it is treated like any other variable, and '->' arrows are not used

Code: [Select]

'DYNAMIC STRUCTURES AND POINTERS
'===============================

type linkedlist
  string key
  sys nex
end type

string listbuf=nuls 100*sizeof linkedlist
linkedlist *pl


@pl=strptr listbuf 'map to string space


'->' arrows are not used
'=======================

pl[1].key="foreach"
pl[2].key="for"
pl[2].nex=@pl[1]-@pl 'make relative link


'listbuf is easily extended
'==========================

listbuf+=nuls 100*sizeof linkedlist
@pl=strptr listbuf 'remember to remap!


print pl[1].key + "    " + pl[2].key  "   "

'jumping by link
'===============

linkedlist *pm
@pm=pl[2].nex+@pl 'try the link

print pm.key

Charles

Oxygen Basic

News:

Author Topic: Idea for a scanner and ? about translating to OB (Read 2664 times)

jumpandrun

Idea for a scanner and ? about translating to OB

Peter

Re: Idea for a scanner and ? about translating to OB

jumpandrun

Re: Idea for a scanner and ? about translating to OB

Charles Pegge

Re: Idea for a scanner and ? about translating to OB

jumpandrun

Re: Idea for a scanner and ? about translating to OB

Charles Pegge

Re: Idea for a scanner and ? about translating to OB