Author Topic: Nice Test Data File 33MB Approximately 3,173,959 records  (Read 6304 times)

0 Members and 1 Guest are viewing this topic.

kryton9

  • Guest
Nice Test Data File 33MB Approximately 3,173,959 records
« on: May 31, 2012, 08:48:45 PM »
I wanted to do some tests with large amounts of data and found this nice free download.
http://www.maxmind.com/app/worldcities
« Last Edit: May 31, 2012, 11:21:00 PM by kryton9 »

JRS

  • Guest
Re: Nice Test Data File 33MB Approximately 2,710,000 records
« Reply #1 on: May 31, 2012, 09:42:59 PM »
Nice find!

I think it would be really easy to wrap the GeoIP API in ScriptBasic. It could then be used (possibly static linked) with sbhttpd to manage access by geo-location. (select language translation for example)


kryton9

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #2 on: May 31, 2012, 11:02:09 PM »
Thanks John. I cleaned up the data. While importing into sqlite3 it was giving errors that are common when some fields have " and ' characters in between the , separator.

The cleaned up csv file:
http://www.kryton9.com/forweb/oxy/data.7z
SciTe can read csv file, but make sure your line numbers are set to a high enough value to show millions of lines.
SciTEGlobal.properties is where you make your changes, at line number 55 line.margin.width=8 is what I used.



Imported into database outlined below:
http://www.kryton9.com/forweb/oxy/MaxMindWorldCity.7z

The database is named MaxMindWorldCity.db  the table I imported into is mmwc
The fields are:
CountryCode char(2)
AscCityName varchar(100)
CityName varchar(255)
StateRegion char(2)
Population unsigned int
Latitude real
Longtitude real
« Last Edit: June 01, 2012, 12:05:12 AM by kryton9 »

Charles Pegge

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #3 on: June 01, 2012, 07:56:44 AM »
144 Meg!

This is the largest lump of data, I have handled so far, but a modern PC make light work of it. Using the import format, keeps it compact enough to load into memory, and hold in one string

Eglwyswrw is a small village about 12 miles from here :)

Code: OxygenBasic
  1.  
  2.  
  3. 'data source:
  4. 'http://www.maxmind.com/app/worldcities
  5. '33meg compressed
  6. '144meg uncompressed textfile
  7.  
  8.  
  9.  
  10. bstring dat=getfile "WorldCitiesPop.txt"
  11.  
  12. if len(dat)=0 then
  13.   print "need WorldCitiesPop.txt" : jmp fwd done
  14. end if
  15.  
  16. 'a=instr dat,chr(10)
  17. 'print a
  18.  
  19. 'Country Code   char(2) ISO 3166 Country Code,
  20. 'ASCII CityName varchar(100)    Name of city or town in ASCII encoding
  21. 'City Name      varchar(255)    Name of city or town in ISO-8859-1 encoding. A list of cities contained in GeoIP City is available. A localized CSV file containing city names in various languages is available.
  22. 'State/Region   char(2) For US, ISO-3166-2 code for the state/province name. Outside of the US, FIPS 10-4 code
  23. 'Population     unsigned int    Population of city (available for over 33,000 major cities only)
  24. 'Latitude       numeric (float) Latitude of city where IP is located
  25. 'Longitude      numeric (float)
  26.  
  27. 'type CityRecord
  28. 'char   CountryCode[2]
  29. 'char   CityNameA[100]
  30. 'char   CityNameL[255]
  31. 'dword  Population
  32. 'float  Latitude
  33. 'float  Longitude
  34. 'end type
  35.  
  36. '#recordof CityRecord
  37. 'CityRecord *c : @c=strptr dat
  38.  
  39.  
  40. 'TEXT FILE FORMAT
  41. '================
  42.  
  43. 'comma delimited (44)
  44. 'line feed (10) record separator
  45.  
  46. byte *byt : @byt=strptr dat
  47.  
  48. lf=chr(10)
  49. cm=chr(44)
  50.  
  51. i=instr dat,"eglwyswrw"
  52. 'i=instr dat, ",chicago"
  53. 'i=instr dat,"gb,london"
  54. if i=0 then i=1
  55. e=instr i,dat,lf
  56.  
  57. 'locate start of record
  58. '======================
  59.  
  60. do
  61.   if byt[ i ]=10 then i++ : exit do
  62.   if i=1 then exit do
  63.   i--
  64. end do
  65. b=i
  66.  
  67. print  mid dat,b,e-b
  68.  
  69. 'split into fields
  70. '=================
  71.  
  72. i=b
  73. pr=""
  74. tab=chr(9)
  75. cr=chr(13)+chr(10)
  76.  
  77. a=instr(i,dat,cm) : pr+="CountryCode: " tab mid(dat,i,a-i) cr : i=a+1
  78. a=instr(i,dat,cm) : pr+="City Key:    " tab mid(dat,i,a-i) cr : i=a+1
  79. a=instr(i,dat,cm) : pr+="City Name:   " tab mid(dat,i,a-i) cr : i=a+1
  80. a=instr(i,dat,cm) : pr+="Region:      " tab mid(dat,i,a-i) cr : i=a+1
  81. a=instr(i,dat,cm) : pr+="Population:  " tab mid(dat,i,a-i) cr : i=a+1
  82. a=instr(i,dat,cm) : pr+="Latitude:    " tab mid(dat,i,a-i) cr : i=a+1
  83. a=instr(i,dat,lf) : pr+="Longtude:    " tab mid(dat,i,a-i) cr : i=a+1
  84.  
  85. print pr
  86.  
  87.  
  88. 'count number of records in database
  89. '===================================
  90.  
  91. b=strptr dat
  92. mov ecx,b
  93. mov edx,0
  94. (
  95.  mov al,[ecx]
  96.  cmp al,0
  97.  jz exit
  98.  cmp al,10
  99.  (
  100.   jnz exit
  101.   inc edx
  102.  )
  103.  inc ecx
  104.  repeat
  105. )
  106. mov e,edx
  107.  
  108. print "Total Records: " e
  109.  
  110. frees dat
  111.  
  112.  
  113. done:
  114.  
« Last Edit: June 01, 2012, 09:07:35 AM by Charles Pegge »

JRS

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #4 on: June 01, 2012, 10:54:26 AM »
I thought it might be a good stress test to see if I could load the original 151 MB file into a string and SPLITA it into an array delimited by the comma  data separator. The array ended up being 19,043,754 string elements. I only have 2 GB of system memory in my laptop and the disk access light was lit up like a candle through the process. I can't believe how slow Linux cleanup of swap space takes. Can someone with few more gigs of system memory give this a shot and tell me how long it takes when done in memory?

Code: [Select]
IMPORT t.bas

s = t::LoadString("worldcitiespop.txt")

SPLITA s BY "," TO a

PRINT UBOUND(a),"\n"


« Last Edit: June 01, 2012, 08:45:41 PM by JRS »

kryton9

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #5 on: June 01, 2012, 11:26:24 AM »
Thanks for your code Charles. Can you modify it so it can get data out with something like the "like" command I used in my sqlite3 test? That is, you put in the starting few characters and a wildcard to bring matching records. 'can%'  the % is wildcard for the any number of characters afterwards. This way we have something compare the benchmarks with. I am impressed with sqlite3 speed.

In the meantime here is doing the same query in sqlite3 that you did in your example as a comparison in attached image.

Now the big question how do you pronounce that town?  My guess  "eagles were"

« Last Edit: June 01, 2012, 11:50:30 AM by kryton9 »

kryton9

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #6 on: June 01, 2012, 11:45:22 AM »
Code: [Select]
IMPORT t.bas

s = t::LoadString("worldcitiespop.txt")

SPLITA s BY "," TO a

PRINT UBOUND(a),"\n"
[/quote]

John, I would like to test it. What do I run where? t.bas just has a print statement in my oxygen folder. Is that SB code or O2? Sorry for my confusion.

Charles Pegge

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #7 on: June 01, 2012, 11:46:32 AM »
eglwyswrw:
egg loos ooroo

guess this one:
Scleddau

First I must trim the hedge before it gets dark, then I'll do a WildCard, Kent

Charles

kryton9

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #8 on: June 01, 2012, 11:58:25 AM »
Quote
guess this one:
Scleddau
  Schooled Dow

I updated my previous post with your code and benchmarked screenshots.

JRS

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #9 on: June 01, 2012, 12:37:19 PM »
Quote
John, I would like to test it. What do I run where? t.bas just has a print statement in my oxygen folder. Is that SB code or O2? Sorry for my confusion.

If your scriba.conf has the paths assigned to your module and include directories, the IMPORT t.bas is all you need. If you don't have the SB configuration setup (in binary form) then you will have to declare the LoadFile function in your program and IMPORT giving the full path surrounded by quotes. Here a minimal scriba.conf.txt you can start off with. (fixing paths) Make sure you compile the configuration file with the scriba -k scriba.conf.txt which generate the binary scriba.conf I normally put in my SB bin directory. (It's created in C:\WINDOW by default)

Code: [Select]
dll ".dll"

module "c:\\scriptbasic\\modules\\"
include "c:\\scriptbasic\\include\\"

maxinclude 100
maxstep 0
maxlocalstep 0
maxlevel 3000
maxmem 0

To run the script, copy (or give full path to) the "worldcitiespop.txt" file and type this in at the command line.

scriba yourscriptname.sb

You may want to add a NOW at the start and again at the end to time the process under Windows. (seconds to do this is fine)

Charles Pegge

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #10 on: June 01, 2012, 12:53:52 PM »
Scleddau:
skle thee

How about Cwmyreglwys?





Here is a simple leftmatch count to add to the demo code

Code: OxygenBasic
  1.  
  2. 'leftmatch (except for country code)
  3. '===================================
  4.  
  5. k="new"
  6. k=cm+k 'prefix comma
  7. a=1
  8. c=0
  9. do
  10.   a=instr a,dat,k
  11.   if a=0 then exit do
  12.   b=instr a,dat,lf
  13.   c++
  14.   'store result index here..
  15.  a=b
  16. end do
  17.  
  18. print "Count for leftmatch: '" k "'" tab c
  19.  
« Last Edit: June 01, 2012, 01:00:50 PM by Charles Pegge »

kryton9

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #11 on: June 01, 2012, 03:40:05 PM »
@John, I ran out of memory also and get an error.
I have 2902 MB, so 2.9GB free memory when I run the app, but it eats up the memory very quickly and runs out.

@Charles How about Cwmyreglwys? koo moor egg loos
I think sqlite3 is going to be tough to beat. I will keep testing it out. In the "eglwyswrw" test, Oxygen:  0.046 seconds.   Sqlite3:  0.00059 seconds.
« Last Edit: June 01, 2012, 03:57:52 PM by kryton9 »

JRS

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #12 on: June 01, 2012, 05:56:17 PM »
Quote
I ran out of memory also and get an error.

At least Linux let me run the script but using the disk for inadequate memory is painfully slow.

Charles Pegge

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #13 on: June 01, 2012, 07:23:52 PM »
Splitting it into 22 million strings is harsh indeed! :) Even the simplest field index, recording the positions of each comma would require an additional 88 meg.

I think SQlite must be very good at indexing its data. Oxygen's instring could be further improved, but only marginally. This a about as good as it gets for unstructured data (.046 secs)

Very close, Kent:

Cwmyreglwys
coom ir egg lewis
Cwm means bay and Eglwys means church.

Charles

PS:

Elvis Presley is an interesting name:

St Elvis is a village near here and the hills which form the interior of Pembrokeshire are called the Preselis.


« Last Edit: June 01, 2012, 07:43:03 PM by Charles Pegge »

kryton9

  • Guest
Re: Nice Test Data File 33MB Approximately 3,173,959 records
« Reply #14 on: June 01, 2012, 08:42:23 PM »
Quote
This a about as good as it gets for unstructured data (.046 secs)
That is still really fast for so many records!

Welsh is also spoken in a province in Argentina as well. http://en.wikipedia.org/wiki/Welsh_language
Thanks for the lessons, but I am terrible not only in remembering programming languages, but in human languages as well. The female side of my family can pick up languages in about 6 weeks, me it will take 6 lifetimes :)