Great news! We’re now at 1.3b of the BAR engine. This means that you can define both text and binary syntaxes easily with BAR.
BAR was syntactically weak when it came to validating and sizing text strings in earlier versions. Take the text “procedure Level17 ;” for example. If this is a de-facto header, it doesn’t fit within a neat header structure in BAR. You’ll need to account for lots of variable-length portions of data, with optional whitespace, and character combinations not easily reconciled by the basic scripting functionality:
char procedure_start_string[] = "procedure Level";
block unorganized textual nofragment procedure_header_1 {
unittype = char;
bool Validation()
return (!memcmp(this, procedure_start_string, strlen(procedure_start_string)));
};
long BlockSize() {
return strlen(procedure_start_string);
};
};
And this is only one portion! The entire portion is characterized by the following:
block organized procedure_header {
mainbody nodelist {
block procedure_header_1;
block numerals;
choice optional { block whitespace; };
block semicolon;
};
};
Note we haven’t even declared how whitespace, numerals, or semicolon are supposed to characterize our fields. The bottom line, folks: this is a yucky, yucky way to characterize text formats.
With BAR 1.3b, you can simplify everything. Replace all the above with just this one line:
block unorganized textual procedure_header ::= "procedure Level" ["0-9"]+ ["\x0-\x20"]* ";"
That’s it! Just one line for a node with a complex syntax.
Regular expressions, which are often defined using either Perl “slashed” syntax or Extended-Backus-Naur Form (EBNF), are rather difficult to read if you’re not familiar with them. However, they are easy to understand once you get the hang of them, and syntactically, they are incredibly powerful.
In BAR’s case, I have chosen to use regular expression syntax that closely resembles the EBNF definitions found on W3C’s website for XML and other formats (http://www.w3.org). I’ve also been designing a still-unreleased I.F. called XML.BAR, which uses many of the same expressions from W3C as a way to characterize unorganized blocks in BAR.
BAR now supports most of the staples found in regular expressions:
- Quoted strings: using “abc” or ‘abc’, indicates presence of whole, case-sensitive strings.
- Character classes []: using brackets, indicates multiple character choices that can be present at any one particular character location.
- Asterisk (*): place on end of expression to repeat indefinitely, and make expression optional.
- Plus (+): place on end of expression to repeat indefinitely, and force presence of at least one iteration.
- Question (?): place on end of expression to make expression optional (0 or 1 instance only).
- Specific Repeat Counts {3, 5}: place on end of expression to make expression have a repeat count within a specific range. In this example, minimum is 3 iterations, maximum is 5 iterations.
- NOT operator (^): place inside character class, in front of quoted string, or in front of parenthetical notation to match every possibility BUT the combination to the right.
- AND, OR, and AND NOT operators ((space, |, -): adjacent expressions with just a space between them (AND), a pipe between them (OR), or a hyphen between them (AND NOT) act as boolean operators when testing multiple conditions in expressions.
There are still limitations:
- Character classes allow a NOT operator inside brackets, but it must not be quoted.
- Character classes have valid characters or ranges inside quotes (single or double). All markup is consistent with BAR’s backslash-oriented markup for string literals; there is no Perl-like markup for whitespace such as /s or related escapes.
- To specify the hyphen character in a character class, it must be placed at the very beginning of the string. All other appearances count as range specifiers.
- ^”abc” Has the effect of returning all characters leading UP to the combination “abc”, if it exists. If “abc” doesn’t exist, the entire set of remaining characters is returned.
- [^"0-9"] Has the effect of matching all characters EXCEPT numerals.
- ^(“abc” | “123″) Only looks at IMMEDIATE location for non-match to “abc” or “123″. Will not scan for either combination and then stop.
- ["a-z"]* – “aa” Only excludes a combination that starts with “aa”. Will not extend to the first arbitrary point at which “aa” is found.
- AND has higher priority than OR, which in turn has higher priority than AND NOT.
- Unorganized blocks are forced to have 1-byte character unit type as well as the nofragment attribute.
- You cannot specify already-declared names of unorganized blocks in these expressions. For example, you can’t declare “Name” first, then declare Name2 ::= Name ” ” Name ” ” Name.
- Organized blocks cannot be declared using regular expressions.
I would eventually like to relax many of these restrictions, especially the last two. Feedback on what sort of improvements you’d like to see in this arena is more than welcome.
The file size and time-to-implementation for many of my formats in the works has dropped dramatically as a result of these changes! A few syntactic changes can go a long way. When each BAR I.F. can act as a unique integrated compiler, the possibilities are endless.