Help building a parser combinator?

A place to discuss the implementation and style of computer programs.

Moderators: phlip, Moderators General, Prelates

User avatar
Robert'); DROP TABLE *;
Posts: 730
Joined: Mon Sep 08, 2008 6:46 pm UTC
Location: in ur fieldz

Help building a parser combinator?

Postby Robert'); DROP TABLE *; » Tue Aug 25, 2015 1:17 am UTC

I am currently building a program to read files from a legacy system originally built in 1999-era C, so each file involved is a opaque series of bytes that would be meaningless without the format specification. (Which I thankfully have.) The actual data/format itself is not any more structured than, "This struct is first and runs for this many bytes, next is a number which is x bytes long, then after that there are that many of this other type..." and does not allow for arbitrarily-deep nesting. There is also no potential branches or optional items in this particular format.

(The new program is in C# and VS15, but I'm more inquiring about the architecture rather than specific coding details.)

I'm aware that the simplest way to parse this data is to do as the original C did, and simply have a procedure that extracts each item in turn and populates structs with them, leaving all of the knowledge of the format implicit in the code. However, doing it that way sounds hard to maintain and prone to bugs like off-by-one errors when accessing the structs. I would like to be able to describe the parsing by specifying how to parse the fundamental types involved (like 2-byte ints, null-terminated strings, and "number followed by n of this other type") and then specifying the whole format by gluing the parsers together.

However, I'm not sure about expressing this in code - about what method(s) a generic "parser" object should have, and how they can return something that can be glued together easily. My first idea was to simply have a method that takes (by reference) a stream and returns a int/string/whatever, but that would implicitly advance the stream and be hard to combine. (What should the combination of an int-parser followed by a string-parser be? If it's a Tuple<int, string> then combining that with something else leads to a Tuple<Tuple<int, string>...>) While I like my code to be pure where possible, I can't think of any more elegant solution.

Does anyone have any advice on this problem? Can this be done elegantly with a more functional approach, rather than an OO one?
...And that is how we know the Earth to be banana-shaped.

User avatar
Flumble
Yes Man
Posts: 2248
Joined: Sun Aug 05, 2012 9:35 pm UTC

Re: Help building a parser combinator?

Postby Flumble » Tue Aug 25, 2015 10:05 am UTC

If you use an off-the-shelf parser generator (for your language of choice), you can clearly and concisely specify your parser rules and the generator will give you a decorated parse tree (using types, classes, dictionaries or whatever your chosen language favours) for every file.
A lot of parser generators also integrate nicely into a project, so I'd advise you to give it a try instead of building your own. Yes, it increases the amount of external code, but I think a tool like, e.g., yacc is trustworthy enough to add without analysis and it saves you time to validate the correctness of your own parser combinator.


So much for my preaching, time to try to give an answer.
If you want to combine multiple production rules into one in an LL(1) language*, every alternative must have a distinct first token/character. Otherwise, the combined parser doesn't know which subrule to choose. This also means you need a "matches" function for every rule/parser that returns whether a token will be accepted by (one of) its rule(s). Every rule/parser should also have a "run" function that does the parsing and returns a node containing references to its child nodes (the return value of the inner production rules), if any, and any other relevant metadata. Every node should be an object of a subclass of an AST Node, which should probably have a "getChildren" function.

If you're looking for a more pure/functional approach, your stream object should be passed into every rule/parser and returned in a modified state again. In the classical procedural approach, you can just put the stream object in a higher scope and make sure you unget the token you fetched to try the "matches" functions.

*meaning you can read the string from front to back one token at a time and it's still deterministic, which I guess your files are.

Sandor
Posts: 180
Joined: Sat Feb 13, 2010 8:25 am UTC

Re: Help building a parser combinator?

Postby Sandor » Wed Aug 26, 2015 1:04 pm UTC

Flumble wrote:If you want to combine multiple production rules into one in an LL(1) language*, every alternative must have a distinct first token/character.

My understanding is that the OP has a fixed binary data format, and not anything that can be a parsed as a language. It sounds like s/he wants something like DFDL (although that particular example only works for Java or Scala, not C++).

Tub
Posts: 472
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Help building a parser combinator?

Postby Tub » Wed Aug 26, 2015 10:32 pm UTC

Robert'); DROP TABLE *; wrote:I'm aware that the simplest way to parse this data is to do as the original C did, and simply have a procedure that extracts each item in turn and populates structs with them, leaving all of the knowledge of the format implicit in the code.

There's nothing wrong with that approach. Using exceptions frees you from littering your code with error checks (for I/O errors mostly), so you can write parser code that's as terse, readable and maintainable as a grammar definition. Heck, looking at the DFDL link posted before me, I'd consider straight code way more readable.

Using a grammar definition does not mean that you're not going to write bugs or that you're not going to have off-by-one-errors. If you swap two fields it doesn't matter if you did so in code or in a grammar definition: the parsing will break in a difficult to debug way.

Robert'); DROP TABLE *; wrote:My first idea was to simply have a method that takes (by reference) a stream and returns a int/string/whatever, but that would implicitly advance the stream and be hard to combine.

Isn't implicite advancing exactly what you want for easy combining? If you're not advancing past everything you've read, that means that you will read and parse some bytes multiple times. Why would you want that?

Robert'); DROP TABLE *; wrote:(What should the combination of an int-parser followed by a string-parser be? If it's a Tuple<int, string> then combining that with something else leads to a Tuple<Tuple<int, string>...>) While I like my code to be pure where possible, I can't think of any more elegant solution.

The way I understand your post, the format is fixed, it's a legacy format that will not be changed or expanded, and the data has a well defined meaning. Why would you try to implement a flexible, generalized solution when all you need is a special case?

I'd make the meaning explicit by reintroducing the C structs, if only to have names for your values.

Code: Select all

class City {
  int zipcode;
  string cityname;
}


is so much more readable than

Code: Select all

Tuple<int, string>


so why don't you use it? If you already implemented methods for reading the basic types, the parser method for the struct would be something like

Code: Select all

City Stream::readCity() {
  City c;
  c.zipcode = this.readInt();
  c.cityname = this.readString();
  return c;
}


Write parsers for your basic types, nest them into parsers for the simple structs, nest those into parsers for your larger structs, nest those into a parser for a struct encompassing the whole file. Am I missing a requirement that wouldn't be solved with that approach?

User avatar
Robert'); DROP TABLE *;
Posts: 730
Joined: Mon Sep 08, 2008 6:46 pm UTC
Location: in ur fieldz

Re: Help building a parser combinator?

Postby Robert'); DROP TABLE *; » Thu Aug 27, 2015 1:00 am UTC

Sandor wrote:
Flumble wrote:If you want to combine multiple production rules into one in an LL(1) language*, every alternative must have a distinct first token/character.

My understanding is that the OP has a fixed binary data format, and not anything that can be a parsed as a language. It sounds like s/he wants something like DFDL (although that particular example only works for Java or Scala, not C++).

This is correct. The input isn't divided into distinct tokens by anything within the file.

Flumble wrote:If you use an off-the-shelf parser generator (for your language of choice), you can clearly and concisely specify your parser rules and the generator will give you a decorated parse tree (using types, classes, dictionaries or whatever your chosen language favours) for every file.

Thanks for the pointer. However, since the format I'm parsing includes length-prefixed lists, I'm not entirely sure how useful an off-the-shelf parser will be, as I don't know how to express length-prefixed structures concisely in a CF/regular language.

If you want to combine multiple production rules into one in an LL(1) language*, every alternative must have a distinct first token/character. Otherwise, the combined parser doesn't know which subrule to choose.

There are no choices or lookaheads in this format.

This also means you need a "matches" function for every rule/parser that returns whether a token will be accepted by (one of) its rule(s).

An issue is that some terms in the format (like the numbers) will accept any string of bytes of the right length, and whether or not those bytes are "really" a valid whatever depends on where they appear in the file. (e.g. the first 32 bytes are, by spec, a string, but nothing in the file actually indicates that they are not also a valid number.)
Tub wrote:Isn't implicite advancing exactly what you want for easy combining? If you're not advancing past everything you've read, that means that you will read and parse some bytes multiple times. Why would you want that?

Originally for "elegance", but having thought about it some more, I think that was a bit of a misplaced optimziation, and writing an impure function would work better.

so why don't you use it? If you already implemented methods for reading the basic types, the parser method for the struct would be something like

Code: Select all

City Stream::readCity() {
  City c;
  c.zipcode = this.readInt();
  c.cityname = this.readString();
  return c;
}

Write parsers for your basic types, nest them into parsers for the simple structs, nest those into parsers for your larger structs, nest those into a parser for a struct encompassing the whole file. Am I missing a requirement that wouldn't be solved with that approach?

That is essentially what I'm going for, but I would like not to have to write that method manually. Instead, I'd like to say something like:

Code: Select all

cityParser = doMagicCombining(new IntParser(), new StringParser())
City c = cityParser.parse(stream)
...And that is how we know the Earth to be banana-shaped.

DaveInsurgent
Posts: 207
Joined: Thu May 19, 2011 4:28 pm UTC
Location: Waterloo, Ontario

Re: Help building a parser combinator?

Postby DaveInsurgent » Thu Aug 27, 2015 5:54 am UTC

Would something like Protocol Buffers help you either as something you can depend on and build from, or draw inspiration from at the very least?

https://developers.google.com/protocol- ... s/overview

Tub
Posts: 472
Joined: Wed Jul 27, 2011 3:13 pm UTC

Re: Help building a parser combinator?

Postby Tub » Thu Aug 27, 2015 5:38 pm UTC

Robert'); DROP TABLE *; wrote:That is essentially what I'm going for, but I would like not to have to write that method manually. Instead, I'd like to say something like:

Code: Select all

cityParser = doMagicCombining(new IntParser(), new StringParser())
City c = cityParser.parse(stream)

You still need a class definition for City somewhere, right? And with that approach, you still need to define a constructor for City, which has to have its parameters ordered in exactly the order they're found in your binary format, right? Don't discount that code when comparing simplicity.


I see two possible approaches:

a) you want your data types to be independent of the binary format. You want them not just for deserialization, but they're going to be used heavily in other parts of the program.

For example, the on-disk-format has an array of ints, but your objects should have a tree structure of ints for faster lookups. Or the objects should have additional properties not found in the on-disk-format. Or something is stored as a byte on disk, but you want it to be an int in your object, or maybe an enum.

In this case, do as I suggested above. Design your objects in the way you need, then make a function for each that grabs a few values from the stream, then constructs an object from them. The overhead is one relatively simple function per class for deserialization, and that's about the most concise way you can express it. No grammar file can make it simpler, since a grammer does not know how to convert a bunch of deserialized values into your desired but slightly incompatible object.


b) you're ok with those data types being an exact match to the binary format. Your objects' attributes are going to have types like uint16_t instead of int, but that's fine with you.

In that case, just define your objects with the correct types and attribute order, and have the compiler do the rest. It should be possible with introspection in java or C#, and with variadic templates in C++11.

In this case, the specification of your binary format is nothing but a bunch of nested classes, specifying type, name and order of each value. I don't think it can get simpler than that.

User avatar
Yakk
Poster with most posts but no title.
Posts: 11128
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Help building a parser combinator?

Postby Yakk » Thu Aug 27, 2015 6:58 pm UTC

This is a C++ approach that relies on overloading etc.

I'll assume you want to be able to read, and write. If you just need to read, skip the writing part.

There are two kinds of archive -- read and write.

Both have the same operator or method overriden (In C++, I do this with operator overloading with different const qualifiers for the two operators).

In read, it sucks from the archive into some data field. In write, it takes from the data field and writes to the archive.

Structures of data contain an overload that runs the operator on each of the elements in-order.

Variable sized arrays with a prefix size read/write the size, then read/write the elements in turn. (This requires a different implementation often). Variable sized arrays with a flag terminator do something slightly different, but similar.

Now composition consists of building a structure, then writing the archive method that both reads and writes.

---

For your original plan, that is sensible. Combining a (io, int&)->io with a (io, double&)->io produces a (io, int&, double&)->io. Combining that with a (io, string&)->io produces a (io, int&, double&, string&)->io. Sure the combination is a bit strange, but it works.

The hard part here is to glue said parsing primitives *to your run-time data*, not defining that composition operation.

Code: Select all

struct inner {
  double x, y;
  staitc parser<inner> parse() {
    return [](auto* self){
      return parser(&self.x) + parser(&self.y);
    };
  }
};
struct outer {
  int n;
  inner bob;
  static parser<outer> parse() {
    return [](auto* self){
      return parser(&self->n) + parser(&self.bob);
    };
  }
};

but while this returns a parser, the boilerplate ends up approaching the directly-say-how-to-parse option.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

User avatar
Robert'); DROP TABLE *;
Posts: 730
Joined: Mon Sep 08, 2008 6:46 pm UTC
Location: in ur fieldz

Re: Help building a parser combinator?

Postby Robert'); DROP TABLE *; » Sun Aug 30, 2015 12:13 am UTC

Tub wrote:b) you're ok with those data types being an exact match to the binary format. Your objects' attributes are going to have types like uint16_t instead of int, but that's fine with you.
In that case, just define your objects with the correct types and attribute order, and have the compiler do the rest. It should be possible with introspection in java or C#, and with variadic templates in C++11.

...I completely missed that I could use reflection and normal class composition for this. I think I'll use this approach for at least the first iteration, so I can easily get the parsed data into memory, and possibly use a more complicated strategy (e.g. what you mentioned in part a) once I have a better idea of what I need to do with the data after that.

Yakk wrote:Both have the same operator or method overriden (In C++, I do this with operator overloading with different const qualifiers for the two operators).

Perhaps I'm just too conventional, but this is a very weird concept for me. Why in particular does the same method have to do both? Is it just a matter of reuse?

For your original plan, that is sensible. Combining a (io, int&)->io with a (io, double&)->io produces a (io, int&, double&)->io. Combining that with a (io, string&)->io produces a (io, int&, double&, string&)->io. Sure the combination is a bit strange, but it works.

I don't quite understand the notation here. Is this a function call to save or load? (Or both?) Is the composition in question simply function composition (as it looks like) or is intended to be more complicated?

The hard part here is to glue said parsing primitives *to your run-time data*, not defining that composition operation.
*code*
but while this returns a parser, the boilerplate ends up approaching the directly-say-how-to-parse option.

I'd have to agree that that looks very similar to the direct method, although I find it much harder to parse.
...And that is how we know the Earth to be banana-shaped.

User avatar
Yakk
Poster with most posts but no title.
Posts: 11128
Joined: Sat Jan 27, 2007 7:27 pm UTC
Location: E pur si muove

Re: Help building a parser combinator?

Postby Yakk » Sun Aug 30, 2015 12:46 am UTC

Your serialization and deserialization code should be the same code, so they stay in sync automatically.

(A,B)->C represents "a mapping from types (A,B) to type C".

In the above case, I'm being trying to steal Haskell monad patterns without understanding them.

If you have a function that takes an IO state and a double, and returns an IO state (possibly modified), one way to chain them is stack-based (logical stack, not stack as in the implementation of automatic storage in C++).

So (IO, X)->IO takes a stack with an IO on top, and an X next. It pops the IO and the X, it processes, then pushes IO back on the stack.

Then (IO, X)->IO followed by (IO,Y)->IO composes to (IO, X, Y)->IO -- it pops IO, X, then Y off the stack, processes, then pushes IO back onto the stack.

How this is implemented may have little to do with stacks -- it just reflects the logic of how composition can be done.

An unstreamer is a (stream_state, T&)->stream_state function. Two unstreamers compose to (stream_state, T&, U&)->stream_state "logically". Now, the functions need not "return" the stream state; note, however, that `operator>>` *does* return (a reference to) the stream state, probably because it makes this kind of composition easy.
One of the painful things about our time is that those who feel certainty are stupid, and those with any imagination and understanding are filled with doubt and indecision - BR

Last edited by JHVH on Fri Oct 23, 4004 BCE 6:17 pm, edited 6 times in total.

User avatar
Robert'); DROP TABLE *;
Posts: 730
Joined: Mon Sep 08, 2008 6:46 pm UTC
Location: in ur fieldz

Re: Help building a parser combinator?

Postby Robert'); DROP TABLE *; » Tue Sep 01, 2015 1:01 am UTC

Yakk wrote:Your serialization and deserialization code should be the same code, so they stay in sync automatically.

How is this actually implemented, though? Getting from a stream and storing in a struct seems to be fundamentally different than the reverse.


(A,B)->C represents "a mapping from types (A,B) to type C".
In the above case, I'm being trying to steal Haskell monad patterns without understanding them.

If you have a function that takes an IO state and a double, and returns an IO state (possibly modified), one way to chain them is stack-based (logical stack, not stack as in the implementation of automatic storage in C++).

So (IO, X)->IO takes a stack with an IO on top, and an X next. It pops the IO and the X, it processes, then pushes IO back on the stack.

Then (IO, X)->IO followed by (IO,Y)->IO composes to (IO, X, Y)->IO -- it pops IO, X, then Y off the stack, processes, then pushes IO back onto the stack.

How this is implemented may have little to do with stacks -- it just reflects the logic of how composition can be done.

Ah, I understand now, although I'm not entirely sure how to actually implement that in a general way.
...And that is how we know the Earth to be banana-shaped.


Return to “Coding”

Who is online

Users browsing this forum: No registered users and 6 guests