Archive for July, 2009

Regular Expressions: A subset of BAR

Monday, July 27th, 2009

Great news! We’re now at 1.3b of the BAR engine. This means that you can define both text and binary syntaxes easily with BAR.

BAR was syntactically weak when it came to validating and sizing text strings in earlier versions. Take the text “procedure Level17 ;” for example. If this is a de-facto header, it doesn’t fit within a neat header structure in BAR. You’ll need to account for lots of variable-length portions of data, with optional whitespace, and character combinations not easily reconciled by the basic scripting functionality:

char procedure_start_string[] = "procedure Level";
block unorganized textual nofragment procedure_header_1 {
unittype = char;

bool Validation()
return (!memcmp(this, procedure_start_string, strlen(procedure_start_string)));
};
long BlockSize() {
return strlen(procedure_start_string);
};
};

And this is only one portion! The entire portion is characterized by the following:

block organized procedure_header {
mainbody nodelist {
block procedure_header_1;
block numerals;
choice optional { block whitespace; };
block semicolon;
};
};

Note we haven’t even declared how whitespace, numerals, or semicolon are supposed to characterize our fields. The bottom line, folks: this is a yucky, yucky way to characterize text formats.

With BAR 1.3b, you can simplify everything. Replace all the above with just this one line:

block unorganized textual procedure_header ::= "procedure Level" ["0-9"]+ ["\x0-\x20"]* ";"

That’s it! Just one line for a node with a complex syntax.

Regular expressions, which are often defined using either Perl “slashed” syntax or Extended-Backus-Naur Form (EBNF), are rather difficult to read if you’re not familiar with them. However, they are easy to understand once you get the hang of them, and syntactically, they are incredibly powerful.

In BAR’s case, I have chosen to use regular expression syntax that closely resembles the EBNF definitions found on W3C’s website for XML and other formats (http://www.w3.org). I’ve also been designing a still-unreleased I.F. called XML.BAR, which uses many of the same expressions from W3C as a way to characterize unorganized blocks in BAR.

BAR now supports most of the staples found in regular expressions:

  • Quoted strings: using “abc” or ‘abc’, indicates presence of whole, case-sensitive strings.
  • Character classes []: using brackets, indicates multiple character choices that can be present at any one particular character location.
  • Asterisk (*): place on end of expression to repeat indefinitely, and make expression optional.
  • Plus (+): place on end of expression to repeat indefinitely, and force presence of at least one iteration.
  • Question (?): place on end of expression to make expression optional (0 or 1 instance only).
  • Specific Repeat Counts {3, 5}: place on end of expression to make expression have a repeat count within a specific range. In this example, minimum is 3 iterations, maximum is 5 iterations.
  • NOT operator (^): place inside character class, in front of quoted string, or in front of parenthetical notation to match every possibility BUT the combination to the right.
  • AND, OR, and AND NOT operators ((space, |, -): adjacent expressions with just a space between them (AND), a pipe between them (OR), or a hyphen between them (AND NOT) act as boolean operators when testing multiple conditions in expressions.

There are still limitations:

  • Character classes allow a NOT operator inside brackets, but it must not be quoted.
  • Character classes have valid characters or ranges inside quotes (single or double). All markup is consistent with BAR’s backslash-oriented markup for string literals; there is no Perl-like markup for whitespace such as /s or related escapes.
  • To specify the hyphen character in a character class, it must be placed at the very beginning of the string. All other appearances count as range specifiers.
  • ^”abc” Has the effect of returning all characters leading UP to the combination “abc”, if it exists. If “abc” doesn’t exist, the entire set of remaining characters is returned.
  • [^"0-9"] Has the effect of matching all characters EXCEPT numerals.
  • ^(“abc” | “123″) Only looks at IMMEDIATE location for non-match to “abc” or “123″. Will not scan for either combination and then stop.
  • ["a-z"]* – “aa” Only excludes a combination that starts with “aa”. Will not extend to the first arbitrary point at which “aa” is found.
  • AND has higher priority than OR, which in turn has higher priority than AND NOT.
  • Unorganized blocks are forced to have 1-byte character unit type as well as the nofragment attribute.
  • You cannot specify already-declared names of unorganized blocks in these expressions. For example, you can’t declare “Name” first, then declare Name2 ::= Name ” ” Name ” ” Name.
  • Organized blocks cannot be declared using regular expressions.

I would eventually like to relax many of these restrictions, especially the last two. Feedback on what sort of improvements you’d like to see in this arena is more than welcome.

The file size and time-to-implementation for many of my formats in the works has dropped dramatically as a result of these changes! A few syntactic changes can go a long way. When each BAR I.F. can act as a unique integrated compiler, the possibilities are endless.

BAR Engine Update: 1.3

Wednesday, July 15th, 2009

The demo version of the software is now running 1.3. While BARfly still has most of these features hidden from the user, the engine is a lot more powerful than it used to be. In particular:

1) BAR Navigational Strings (BNS) now officially supported
2) Bookmark stacks (Push and Pop for bookmarks) now officially supported
3) More advanced reading, writing, insertion, and deletion operations
4) Many new I.F. operators and built-in functions
5) Ability to insert and delete nodes from I.F. functions
6) Native callback functions from I.F. functions
7) More diverse auto-advancement options for reading, writing, insertion, and deletion operations

BAR developers can take advantage of many new programming features, like scripts that can insert and delete nodes, more “STDLIB.H-type functions” like atof, atol, stricmp, and strtok, and more optimized compiled opcode execution.

Why does BARfly not look different? For a simple reason: most of these updates would only give you a few more function-call choices when you press F8. The real power lies in the type of implementation files you can create now!

Right on the heels of this update will come two even more important updates:

1) DecideChoice: a new method for large decision lists, which picks a choice using a numerical index rather than evaluating each individual list choice.
2) Unorganized block “regular expression” definitions: the ability to build text blocks using EBNF notation (like W3C uses to describe XML). This will enhance BAR, allowing it to perform high-quality text-parsing operations with ease!

Related to that last point, should BAR be changed to BATAR, or “Binary AND Text Artifact Reference?” Well, not really. Text is really a subset of binary, so it’s still fair to call it BAR.

BNS: BAR Navigational Strings

Wednesday, July 15th, 2009

Here’s something that was indirectly supported by BARfly, but only recently supported by the underlying BAR engine with version 1.3.

A BAR Navigational String, or BNS, is a “path” string used to locate BAR nodes in the tree control using absolute or relative positions. I’ve actually modeled it off the UNIX/DOS/Windows method of accessing files and directories through pathnames.

For example, if you want to access a file, called “thisfile.txt”, located inside the “thisdir” directory, which is itself nested inside another “outerdir” directory, the pathname would look like this:

“outerdir/thisdir/thisfile.txt”

Similarly, in BAR, you can access a node several levels “deep” in the tree by referring to children with names or numbers. Each token, separated by forward or backward slashes, represents one navigation. Take the following example:

“outernode/thisnode/62″

This BNS says, execute Child(“outernode”), followed by Child(“thisnode”), followed by Child(62).

Neat, isn’t it? You can compress many navigation operations into only one function argument. The best part is that almost nothing new needs to be learned. The philosophy behind node navigation is totally synonymous with file system navigation!

There are other parallels, too. A summary of how BNS compares with file system pathnames:

“/”
File system: Go to root directory
BNS: Execute Toplevel()

“a/b/c”
File system: Go to subdirectory “a,” then subdirectory “b,” then refer to “c” (can be either file or directory)
BNS: Execute Child(“a”), then Child(“b”), then Child(“c”)

“../otherdir”
File system: Go up to parent directory, then down again to subdirectory “otherdir”
BNS: Execute Parent(), then Child(“otherdir”)

“.”
File system: Refer to same directory
BNS: No navigation

Beyond this point, BNS and file system pathnames diverge. For example, BNS does not currently support wildcards, and BNS is capable of other operations that file systems cannot perform. Some additional BNS syntax:

“44″
Navigate to zero-based child (45th child of current)

“>>”
Navigate to next 2 siblings (call Next(1) twice)

“<<<"
Navigate to previous 3 siblings (call Previous(1) three times)

“+50″
Navigate 50 siblings forward (call Next(50))

“-30″
Navigate 30 siblings backward (call Previous(30))

“^13″
Search forward for construct UID of 13 (call Search_Forward(13))

“^container7″
Search forward for construct or variable name of “container7″ (call Search_Forward(“container7″))

Eventually, BNS will support even more radical navigation possibilities, like wildcards, named bookmarks, and maybe even special subroutine-based navigations.

Ultimately, my goal is to merge BNS with file system logic. Think about what you could do if your ability to access information is not confined to just the file system or even the file contents: you could “collect” or “populate” information inside a file from the command line of an application! Too bad no operating systems let you dive this “deep” right now.

Pop quiz: what does this BNS do?

“/1078/mychild/./nextchild/../+8/otherchild/-1/32/>>>>>/^1″

Answer:

Toplevel(); Child(1078); Child(“mychild”); ; Child(“nextchild”); Parent(); Next(8); Child(“otherchild”); Previous(); Child(32); Next(5); Search_Forward(1);

Characterization vs. Conversion

Thursday, July 9th, 2009

NOTE: This entry is rather technical in nature, geared towards programmers.

A fellow named Robby recently posed a software engineer’s dilemma when it comes to characterizing a file like a database. The question is, at what point is the data useful to me? It might be useful only when converted into the data I want. Or, alternatively, it might only be useful completely raw. Or, possibly, somewhere in the middle.

In other words, where, in the process of deserialization, do you “stop?”

The nice thing about BAR is that you can choose, in the schema, exactly how far you wish to go when you deserialize. You are still limited by the nature of the file format itself, of course: those formats that are constructed with little consideration given to hierarchy, organization, or resynchronization on error will limit a person’s options.

Robby’s example was a JPEG file. Good example, because I offer a free JPEG I.F. on the website.

At the rawest of raw, use FLAT. This yields the entire JPEG file as a single unorganized block. You can read or write anything with FLAT–but chances are you want just a bit more detail.

The next step up is the free format, which breaks the data into segments. The actual image scan itself, though, is untouched.

The next step up is characterization of the bit scan fields (Arithmetic or Huffman). However, no attempt at converting the fields takes place.

The next step up is converting the bit fields into data that can be used. But…I’m being too kind here! In fact, there are four or five individual “stopping points” you could rest at, since many decoding steps are necessary for JPEG. This includes…

1) Arithmetic/Huffman field translation
2) IDCT (Inverse Discrete Cosine Transform) translation
3) Quantization
4) Component generation (generally YUV)
5) Image pixel generation (generally RGB)

Robby’s question was what was useful to him. But I’m asking a more radical question: what are all the possible ways this format can be useful to you?

BAR gives you an unprecedented luxury in being able to “see” the progress as it’s being done. If you need to develop an encoding or decoding implementation on your own, you generally have to rely on classic debugging and testing techniques: conditional breakpoints, single-stepping, debug-output dumps of iterative data. Not to mention cumbersome exception-handlers when you’ve inevitably screwed up.

If you screw up in BAR? No exceptions. Full call stack report. Full node record report. Immediate API return. All that, and it’s platform-independent, AND language-independent.

BAR can be used to characterize. BAR can be used to convert. I won’t presume to know exactly what each individual wants to do with his or her files. The power of BAR is really in the question, not the answer.

Why Use BAR?

Thursday, July 9th, 2009

One of the most common questions I’ve found people asking me is this: Why would I want to use BAR or BARfly? What advantage do I gain by using this product?

Hmmm…if the product features described in the Main BARfly Website don’t provide a good answer, it will be a hard question to answer.

It’s possible your needs are very specific. For this reason, I’ve provided in the documentation some ideas about who you might be and why you would want to use BARfly. A quick rehash:

  1. Software Developers: People wanting to write code to support and maintain particular file formats
  2. Software Architects: People wanting to design structural elements to a software application
  3. Software Testers: People wanting to examine the contents of generated files or memory content
  4. Security Auditors: People wanting to study a company’s ability to keep data secure from hackers and crackers
  5. Database Administrators: People wanting to detect flaws and inefficiencies in a database, as well as develop solutions
  6. System Troubleshooters: People wanting to audit, diagnose, and fix files (tasks that were expensive or impossible before BARfly)
  7. Network Administrators: People wanting to examine traffic over a network in a schema-oriented fashion
  8. System Administrators: People wanting to do a number of the things listed above
  9. Cryptographers and Cryptanalysts: People trying to design and crack encrypted formats
  10. Casual Validators: Analysts that wants to check a file for consistency
  11. Data Entry Specialists: Individuals that must perform high-throughput data entry and format conversions
  12. Very Curious People: Individuals wanting to find out what all that weird unreadable stuff on their hard drive is

There are three builds of BARfly, which have capabilities reflecting the needs of the user:

  1. BARfly Bronze: Contains only viewing capability. You can view files, but you cannot edit them. Nor can you develop your own BAR implementation files.
  2. BARfly Silver: Contains viewing and editing capability. You can view files, edit them, and save them. You cannot develop your own BAR implementation files.
  3. BARfly Gold: Contains viewing and editing capability, plus the ability to develop BAR implementation files. This build comes with an integrated compiler, BARCC, that allows a user unlimited ability to create, edit, test, and use customized schemas.

Now, as for using the BAR engine in your own application, I’ll let you answer that question yourself. There are a HUGE number of technologies and languages being used. The fact that many “meta-languages” have been created has actually made the problem worse, because when people program in a meta, it actually reduces the readability and comprehensibility of the code or markup being written.

What’s to gain by learning an entirely new one? A lot, actually. For the most part, there’s very little to learn. If you know C++, you’ll know about 98% of BAR. BAR tries to keep the scripting and data-definition language low-level for handling low-level data. Where BAR is truly unique is in two simple definition types: blocks, and lists.

Think of BAR as the “XML Schema” for all files, text or binary. There are profound advantages to having applications reference platform-independent schemas for all their architecture, whether it’s in-memory only, secondary-storage only, or some combination of the two.

As of this writing, no attempt has been made to offer the BAR software development kit on this website. If enough people have looked at the documentation and are interested in trying it out, I’ll release it at a very reasonable cost, perhaps even for free.