Archive for August, 2009

RegExp in BAR: An Application

Monday, August 24th, 2009

We know regular expressions are used all the time. But what do they look like, and how do they fit into the larger scheme of things with BAR?

I’ve provided a case study here. There is an I.F. for Kroz level files (one of Apogee’s earliest game creations) on this site. When I designed the I.F. originally, it picked apart data from source code by using BAR’s 1.0 functionality, which was relatively limited–binary parsing only. But source code is best parsed as text, by regular expressions…which only debuted with version 1.3b of BAR.

No one should reasonably expect to write binary parsing code for most text formats, especially if you aren’t dealing with a cornucopia of powerfully optimized functions in your arsenal. Instead, just write regular expressions for the simplest of syntaxes and work your way up in complexity. Here’s a complete list of the regular expression assignments used in kroz_alt.bar, the 1.3b-capable version of the Kroz level file reader:


//Unordered level syntax
block unorganized textual df_header ::= "DF[" ["0-9"]+ "]:=" [^"'"]* "'";
block unorganized textual df_chars ::= [^"'"]*;
block unorganized textual df_transition ::= "'" ["\x0-\x20"]* "+" [^"'"]* "'";
block unorganized nofragment df_spawncounts { unittype = unsigned short; };

//Ordered level syntax
block unorganized textual procedure_header ::= "procedure Level" ["0-9"]+ ";" ["\x0-\x20"]* "begin";
block unorganized textual fp_padding ::= ^"FP[";
block unorganized textual fp_header ::= "FP[" ["0-9"]+ "]:=" ["\x0-\x20"]* "'";
block unorganized textual fp_chars ::= [^"'"]*;
block unorganized textual fp_remainder ::= ^"end;";
block unorganized textual nofragment fp_symbol_line {
unittype = unsigned char;
enum = kroz_char_enums;
};

//Transition syntax
block unorganized textual df_filler ::= ^"DF[";
block unorganized textual df_filler2 ::= "DF[" ^"DF[";
block unorganized textual fp_filler ::= ^"procedure Level";
block unorganized textual fp_filler2 ::= "procedure Level" ^"procedure Level";

You'll never need to write memcmps and strcmps and strtoks and mids and strlens and...you get the idea. All the above unorganized blocks instantly validate and size properly if the pattern matches!

Of course, there is often a need to put fields together in particular patterns, in which you must extract individual portions of each field. You don't want to make a truly massive regular expression for the entire dataset in such a case--it gives you only one node of data. Instead, you'll want to employ organized blocks to characterize the text fields in a more list-friendly fashion:


block organized ordered_level_line {
mainbody nodelist {
block fp_padding;
block fp_header;
block fp_chars;
};
bool Termination() { return (++iterations >= max_ylines); };
};

block organized ordered_level {
void Initialization() { iterations = 0; };

mainbody nodelist {
block procedure_header;
block organized ordered_level_lines {
mainbody nodelist repeats {
block ordered_level_line;
};
};
block fp_remainder;
};
};

block organized unordered_level {
mainbody nodelist {
block df_header;
block df_chars;
choice optional { block df_transition; };
choice optional { block df_chars; };
choice optional { block df_transition; };
choice optional { block df_chars; };
};
};

In the final release of the I.F., I've made the block definitions a bit more complex, of course. This is because the above definitions only give you the text fields as an ordered list--people would find the data a heckuva lot more useful if the fields had undergone alphanumeric-to-binary conversion, had enumerations broken out, etc.

To do this, just add some Deserialize calls and you're good to go.

The output for an unordered level looks like this:


struct unordered_spawn_counts {
slow_enemy = 600,
medium_enemy = 0,
fast_enemy = 0,
breakable_block = 0,
whip = 20,
stairs = 1,
chest = 0,
slow_time = 5,
gem = 30,
blindness_potion = 0,
teleport_scroll = 5,
key = 0,
door = 0,
solid_wall = 0,
speed_time = 0,
teleport_trap = 0,
river = 0,
power_ring = 0,
forest = 0,
tree = 0,
bomb = 0,
lava = 0,
pit = 0,
staff = 0,
tunnel = 0,
freeze_time = 0,
nugget_or_artifact = 20,
quake_trap = 5,
invisible_breakable_block = 0,
invisible_solid_wall = 0,
invisible_door = 0,
enemy_stop_space = 0,
enemy_activator = 0,
enemy_zap_spell = 0,
enemy_creation_trap = 0,
enemy_generator = 0,
enemy_activator2 = 0,
moving_block = 700...

A list of descriptive spawn counts for a randomly generated level! Now that's useful. But is it an improvement? See for yourself how Scott Miller encoded them originally:


DF[21]:=
{ 1 2 3 X W L C S + I T K D # F . R Q B V = A U Z * E ; : - @ ] G ( M )}
'600 20 1 5 30 5 20 5 700 '+
{ P ! O H N [ | " 4 5 6 7 8 9 Y 0 ~ $}

Good Lord! I’m dead serious! If you can make sense of that, you let me know!

As far as parsing difficulty is concerned, I’d place Kroz level files in the “moderate” category when it comes to what you can do with regular expressions. XML would fall into the “easy” category because it’s very sound and has a well-defined syntax. METAR would fall into the “hard” category because it’s ill-defined, inconsistent, and barely even human-readable. Not that any text format would be impossible.

Since BAR is a relatively new technology, I’m all ears for interesting new challenges people have with ETL or other readabilty/portability/conversion issues. Chances are good that BAR can tear it to pieces within hours.