Archive for the ‘What’s in a BAR?’ Category

RegExp in BAR: An Application

Monday, August 24th, 2009

We know regular expressions are used all the time. But what do they look like, and how do they fit into the larger scheme of things with BAR?

I’ve provided a case study here. There is an I.F. for Kroz level files (one of Apogee’s earliest game creations) on this site. When I designed the I.F. originally, it picked apart data from source code by using BAR’s 1.0 functionality, which was relatively limited–binary parsing only. But source code is best parsed as text, by regular expressions…which only debuted with version 1.3b of BAR.

No one should reasonably expect to write binary parsing code for most text formats, especially if you aren’t dealing with a cornucopia of powerfully optimized functions in your arsenal. Instead, just write regular expressions for the simplest of syntaxes and work your way up in complexity. Here’s a complete list of the regular expression assignments used in kroz_alt.bar, the 1.3b-capable version of the Kroz level file reader:


//Unordered level syntax
block unorganized textual df_header ::= "DF[" ["0-9"]+ "]:=" [^"'"]* "'";
block unorganized textual df_chars ::= [^"'"]*;
block unorganized textual df_transition ::= "'" ["\x0-\x20"]* "+" [^"'"]* "'";
block unorganized nofragment df_spawncounts { unittype = unsigned short; };

//Ordered level syntax
block unorganized textual procedure_header ::= "procedure Level" ["0-9"]+ ";" ["\x0-\x20"]* "begin";
block unorganized textual fp_padding ::= ^"FP[";
block unorganized textual fp_header ::= "FP[" ["0-9"]+ "]:=" ["\x0-\x20"]* "'";
block unorganized textual fp_chars ::= [^"'"]*;
block unorganized textual fp_remainder ::= ^"end;";
block unorganized textual nofragment fp_symbol_line {
unittype = unsigned char;
enum = kroz_char_enums;
};

//Transition syntax
block unorganized textual df_filler ::= ^"DF[";
block unorganized textual df_filler2 ::= "DF[" ^"DF[";
block unorganized textual fp_filler ::= ^"procedure Level";
block unorganized textual fp_filler2 ::= "procedure Level" ^"procedure Level";

You'll never need to write memcmps and strcmps and strtoks and mids and strlens and...you get the idea. All the above unorganized blocks instantly validate and size properly if the pattern matches!

Of course, there is often a need to put fields together in particular patterns, in which you must extract individual portions of each field. You don't want to make a truly massive regular expression for the entire dataset in such a case--it gives you only one node of data. Instead, you'll want to employ organized blocks to characterize the text fields in a more list-friendly fashion:


block organized ordered_level_line {
mainbody nodelist {
block fp_padding;
block fp_header;
block fp_chars;
};
bool Termination() { return (++iterations >= max_ylines); };
};

block organized ordered_level {
void Initialization() { iterations = 0; };

mainbody nodelist {
block procedure_header;
block organized ordered_level_lines {
mainbody nodelist repeats {
block ordered_level_line;
};
};
block fp_remainder;
};
};

block organized unordered_level {
mainbody nodelist {
block df_header;
block df_chars;
choice optional { block df_transition; };
choice optional { block df_chars; };
choice optional { block df_transition; };
choice optional { block df_chars; };
};
};

In the final release of the I.F., I've made the block definitions a bit more complex, of course. This is because the above definitions only give you the text fields as an ordered list--people would find the data a heckuva lot more useful if the fields had undergone alphanumeric-to-binary conversion, had enumerations broken out, etc.

To do this, just add some Deserialize calls and you're good to go.

The output for an unordered level looks like this:


struct unordered_spawn_counts {
slow_enemy = 600,
medium_enemy = 0,
fast_enemy = 0,
breakable_block = 0,
whip = 20,
stairs = 1,
chest = 0,
slow_time = 5,
gem = 30,
blindness_potion = 0,
teleport_scroll = 5,
key = 0,
door = 0,
solid_wall = 0,
speed_time = 0,
teleport_trap = 0,
river = 0,
power_ring = 0,
forest = 0,
tree = 0,
bomb = 0,
lava = 0,
pit = 0,
staff = 0,
tunnel = 0,
freeze_time = 0,
nugget_or_artifact = 20,
quake_trap = 5,
invisible_breakable_block = 0,
invisible_solid_wall = 0,
invisible_door = 0,
enemy_stop_space = 0,
enemy_activator = 0,
enemy_zap_spell = 0,
enemy_creation_trap = 0,
enemy_generator = 0,
enemy_activator2 = 0,
moving_block = 700...

A list of descriptive spawn counts for a randomly generated level! Now that's useful. But is it an improvement? See for yourself how Scott Miller encoded them originally:


DF[21]:=
{ 1 2 3 X W L C S + I T K D # F . R Q B V = A U Z * E ; : - @ ] G ( M )}
'600 20 1 5 30 5 20 5 700 '+
{ P ! O H N [ | " 4 5 6 7 8 9 Y 0 ~ $}

Good Lord! I’m dead serious! If you can make sense of that, you let me know!

As far as parsing difficulty is concerned, I’d place Kroz level files in the “moderate” category when it comes to what you can do with regular expressions. XML would fall into the “easy” category because it’s very sound and has a well-defined syntax. METAR would fall into the “hard” category because it’s ill-defined, inconsistent, and barely even human-readable. Not that any text format would be impossible.

Since BAR is a relatively new technology, I’m all ears for interesting new challenges people have with ETL or other readabilty/portability/conversion issues. Chances are good that BAR can tear it to pieces within hours.

Regular Expressions: A subset of BAR

Monday, July 27th, 2009

Great news! We’re now at 1.3b of the BAR engine. This means that you can define both text and binary syntaxes easily with BAR.

BAR was syntactically weak when it came to validating and sizing text strings in earlier versions. Take the text “procedure Level17 ;” for example. If this is a de-facto header, it doesn’t fit within a neat header structure in BAR. You’ll need to account for lots of variable-length portions of data, with optional whitespace, and character combinations not easily reconciled by the basic scripting functionality:

char procedure_start_string[] = "procedure Level";
block unorganized textual nofragment procedure_header_1 {
unittype = char;

bool Validation()
return (!memcmp(this, procedure_start_string, strlen(procedure_start_string)));
};
long BlockSize() {
return strlen(procedure_start_string);
};
};

And this is only one portion! The entire portion is characterized by the following:

block organized procedure_header {
mainbody nodelist {
block procedure_header_1;
block numerals;
choice optional { block whitespace; };
block semicolon;
};
};

Note we haven’t even declared how whitespace, numerals, or semicolon are supposed to characterize our fields. The bottom line, folks: this is a yucky, yucky way to characterize text formats.

With BAR 1.3b, you can simplify everything. Replace all the above with just this one line:

block unorganized textual procedure_header ::= "procedure Level" ["0-9"]+ ["\x0-\x20"]* ";"

That’s it! Just one line for a node with a complex syntax.

Regular expressions, which are often defined using either Perl “slashed” syntax or Extended-Backus-Naur Form (EBNF), are rather difficult to read if you’re not familiar with them. However, they are easy to understand once you get the hang of them, and syntactically, they are incredibly powerful.

In BAR’s case, I have chosen to use regular expression syntax that closely resembles the EBNF definitions found on W3C’s website for XML and other formats (http://www.w3.org). I’ve also been designing a still-unreleased I.F. called XML.BAR, which uses many of the same expressions from W3C as a way to characterize unorganized blocks in BAR.

BAR now supports most of the staples found in regular expressions:

  • Quoted strings: using “abc” or ‘abc’, indicates presence of whole, case-sensitive strings.
  • Character classes []: using brackets, indicates multiple character choices that can be present at any one particular character location.
  • Asterisk (*): place on end of expression to repeat indefinitely, and make expression optional.
  • Plus (+): place on end of expression to repeat indefinitely, and force presence of at least one iteration.
  • Question (?): place on end of expression to make expression optional (0 or 1 instance only).
  • Specific Repeat Counts {3, 5}: place on end of expression to make expression have a repeat count within a specific range. In this example, minimum is 3 iterations, maximum is 5 iterations.
  • NOT operator (^): place inside character class, in front of quoted string, or in front of parenthetical notation to match every possibility BUT the combination to the right.
  • AND, OR, and AND NOT operators ((space, |, -): adjacent expressions with just a space between them (AND), a pipe between them (OR), or a hyphen between them (AND NOT) act as boolean operators when testing multiple conditions in expressions.

There are still limitations:

  • Character classes allow a NOT operator inside brackets, but it must not be quoted.
  • Character classes have valid characters or ranges inside quotes (single or double). All markup is consistent with BAR’s backslash-oriented markup for string literals; there is no Perl-like markup for whitespace such as /s or related escapes.
  • To specify the hyphen character in a character class, it must be placed at the very beginning of the string. All other appearances count as range specifiers.
  • ^”abc” Has the effect of returning all characters leading UP to the combination “abc”, if it exists. If “abc” doesn’t exist, the entire set of remaining characters is returned.
  • [^"0-9"] Has the effect of matching all characters EXCEPT numerals.
  • ^(“abc” | “123″) Only looks at IMMEDIATE location for non-match to “abc” or “123″. Will not scan for either combination and then stop.
  • ["a-z"]* – “aa” Only excludes a combination that starts with “aa”. Will not extend to the first arbitrary point at which “aa” is found.
  • AND has higher priority than OR, which in turn has higher priority than AND NOT.
  • Unorganized blocks are forced to have 1-byte character unit type as well as the nofragment attribute.
  • You cannot specify already-declared names of unorganized blocks in these expressions. For example, you can’t declare “Name” first, then declare Name2 ::= Name ” ” Name ” ” Name.
  • Organized blocks cannot be declared using regular expressions.

I would eventually like to relax many of these restrictions, especially the last two. Feedback on what sort of improvements you’d like to see in this arena is more than welcome.

The file size and time-to-implementation for many of my formats in the works has dropped dramatically as a result of these changes! A few syntactic changes can go a long way. When each BAR I.F. can act as a unique integrated compiler, the possibilities are endless.

BAR Engine Update: 1.3

Wednesday, July 15th, 2009

The demo version of the software is now running 1.3. While BARfly still has most of these features hidden from the user, the engine is a lot more powerful than it used to be. In particular:

1) BAR Navigational Strings (BNS) now officially supported
2) Bookmark stacks (Push and Pop for bookmarks) now officially supported
3) More advanced reading, writing, insertion, and deletion operations
4) Many new I.F. operators and built-in functions
5) Ability to insert and delete nodes from I.F. functions
6) Native callback functions from I.F. functions
7) More diverse auto-advancement options for reading, writing, insertion, and deletion operations

BAR developers can take advantage of many new programming features, like scripts that can insert and delete nodes, more “STDLIB.H-type functions” like atof, atol, stricmp, and strtok, and more optimized compiled opcode execution.

Why does BARfly not look different? For a simple reason: most of these updates would only give you a few more function-call choices when you press F8. The real power lies in the type of implementation files you can create now!

Right on the heels of this update will come two even more important updates:

1) DecideChoice: a new method for large decision lists, which picks a choice using a numerical index rather than evaluating each individual list choice.
2) Unorganized block “regular expression” definitions: the ability to build text blocks using EBNF notation (like W3C uses to describe XML). This will enhance BAR, allowing it to perform high-quality text-parsing operations with ease!

Related to that last point, should BAR be changed to BATAR, or “Binary AND Text Artifact Reference?” Well, not really. Text is really a subset of binary, so it’s still fair to call it BAR.

Characterization vs. Conversion

Thursday, July 9th, 2009

NOTE: This entry is rather technical in nature, geared towards programmers.

A fellow named Robby recently posed a software engineer’s dilemma when it comes to characterizing a file like a database. The question is, at what point is the data useful to me? It might be useful only when converted into the data I want. Or, alternatively, it might only be useful completely raw. Or, possibly, somewhere in the middle.

In other words, where, in the process of deserialization, do you “stop?”

The nice thing about BAR is that you can choose, in the schema, exactly how far you wish to go when you deserialize. You are still limited by the nature of the file format itself, of course: those formats that are constructed with little consideration given to hierarchy, organization, or resynchronization on error will limit a person’s options.

Robby’s example was a JPEG file. Good example, because I offer a free JPEG I.F. on the website.

At the rawest of raw, use FLAT. This yields the entire JPEG file as a single unorganized block. You can read or write anything with FLAT–but chances are you want just a bit more detail.

The next step up is the free format, which breaks the data into segments. The actual image scan itself, though, is untouched.

The next step up is characterization of the bit scan fields (Arithmetic or Huffman). However, no attempt at converting the fields takes place.

The next step up is converting the bit fields into data that can be used. But…I’m being too kind here! In fact, there are four or five individual “stopping points” you could rest at, since many decoding steps are necessary for JPEG. This includes…

1) Arithmetic/Huffman field translation
2) IDCT (Inverse Discrete Cosine Transform) translation
3) Quantization
4) Component generation (generally YUV)
5) Image pixel generation (generally RGB)

Robby’s question was what was useful to him. But I’m asking a more radical question: what are all the possible ways this format can be useful to you?

BAR gives you an unprecedented luxury in being able to “see” the progress as it’s being done. If you need to develop an encoding or decoding implementation on your own, you generally have to rely on classic debugging and testing techniques: conditional breakpoints, single-stepping, debug-output dumps of iterative data. Not to mention cumbersome exception-handlers when you’ve inevitably screwed up.

If you screw up in BAR? No exceptions. Full call stack report. Full node record report. Immediate API return. All that, and it’s platform-independent, AND language-independent.

BAR can be used to characterize. BAR can be used to convert. I won’t presume to know exactly what each individual wants to do with his or her files. The power of BAR is really in the question, not the answer.

What’s in a BAR?

Tuesday, June 16th, 2009

Files, files, and more files. Am I contributing to implementation chaos by creating yet another file format? Not really. BAR is one of those formats that, like XML Schema, can have a “schema for itself.” You can get the BAR format specification on my website.

You have your JAR files and your BAR files. You have your EXE files and your OUT files. Code is code, data is data, and there always seems to be yet one more mechanism for interpreting the huge numbers of bytes that fly across your computer every day.

As for BAR, it has code and data, but unlike other mechanisms for general-purpose code execution or data representation, BAR is intended to be ultra-light. This means that whatever it takes to represent the contents of a file, be it code, markup, or some combination of the two, BAR gets it done with the least amount of resources.

Specifically, a BAR implementation file contains the following:

1) Header: information about the file.
2) Value table: long integers to be looked up later (enumerated constants, tokens, and critical-step UIDS).
3) Construct definitions: the “building blocks” of a file format, for the most part data representation.
4) Variable definitions: individual named quantities used to store data in structures, local variables, global variables, and parameter variables.
5) Enumerated constant definitions: collections of enumerated constants.
6) Function definitions: support functions (critical-step methods and utility functions).
7) Compiled script byte code.
8) String look-up table.
9) (Optional) The uncompiled source code.

And the beauty of all this? It fits into only a few kilobytes. Many BAR implementation files are only around 5,000 bytes!

Platform independence is a nice thing to have. It’s also somewhat taxing to design for platform independence. Many platforms, notably the Java runtime, are HUGE. This is because the runtime needs to do many things that a program is not expected to store in its own code, like compiled low-level calculation routines, interrupt callback handling, low-level network processing, etc. The operating system might provide similar services, or it might not.

What about the BAR engine’s runtime? The Win32 static library version weighs in at only a few hundred kilobytes.

The main reason the BAR engine and associated BAR implementation files are so light is that the purpose is limited. You are NOT trying to display an impressive front-end, handle real-time input, program hardware, or handle networking directly. The sole purpose of a BAR I.F. is to represent information. You can tweak a BAR I.F. to perform conversions to “cook” the data in a file, but this is optional. Use BAR to characterize primarily, and convert secondarily.

Right now, there are about 25 small, basic implementation files on the BARfly website. Not all the file formats, but still enough to characterize much of the information floating around the web. Take the average size of a BAR I.F., which is about 10 KB, and multiply by 100, and you’ve only got 1 MB worth of schemas to characterize virtually every format in existence!

It will be a good day when BAR becomes the end-all for representing file data. Contributions from BARfly Gold users will only make the job easier.