tetsujin wrote:I think there's not much to be gained from going with something simpler than JSON, and probably something to be lost. Getting reliable delimitation of values is great, but that's pretty much all you get... And while it's great that two or three utilities already support CSV, it doesn't amount to much. Two or three utilities support null-delimited values, and they're not the same ones. There's probably another format or two that fit the pattern. Point is, that still leaves the majority unaddressed...
It's actually the reverse case. Most utilities are already nearly CSV compliant. It's just a matter of changing delimiters and/or properly escaping the output. CSV is also fairly widely available, and if not it's trivial to parse/generate.
Only if you ignore the deeper issues.
For instance, how do you embed a comma, or a newline, as part of one of your fields? The common rules are:
1: If your field contains anything funky, wrap it in double quotes.
2: If you want to put a double-quote character in a field, double it.
Now if you want to actually follow
those rules, then parsing CSV is harder than just setting $IFS or whatever. You need to recognize things like the fact that a comma within quotes isn't a field separator, and a newline within quotes isn't a record separator. You can't just readline() and split(). It's still an easy format to parse, but I would argue:
1: it's not significantly
easier to parse than JSON, not if you want to get it right1
. That is, you need to actually parse it, rather than just reading a line at a time and scanning for commas.
2: If you use a CSV library to parse it (and, thus, get it right) - then it's no easier than using, say, a JSON library to do the same thing.
3a: the fact that implementations of CSV so frequently ignore the quoting issue means that you'll wind up in scenarios where you'll think
you can just pass a chunk of data to a particular program, when in fact that program's gonna mangle the stream. (This is much less of an issue with JSON, as the format was established with most or all of those decisions already made, and there's a clear authority on decisions regarding the JSON format - go to json.org and you will see the
rules that define the JSON format. Go to Wikipedia and you will see the most common
rules used to define CSV - along with regional variations (like semicolons as field separator for countries that use comma as a decimal point in numbers2
), inconsistently-supported rules (like all the quoting rules - the stuff that makes payload encapsulation fully-featured and robust, but which is often omitted from CSV parser implementations because people think, "Oh, comma-separated values, I'll just call split()."), would-be "standards" like RFC4180, and so on.)
3b: Alternately, if you say, "I'm just gonna use the simple subset of CSV that doesn't use quoting rules" - then you can't encapsulate quotes or newlines inside field values, and you'll choke if you process a stream written by a program that does
use that syntax.
CSV can easily accomplish meta or binary data.
Well, yes, more or less: but there are a few things to consider:
1: the receiving program doesn't know
it's binary data, let alone what it's supposed to represent.
2: By ASCII-encoding the payload you're inflating its size a lot. (Doubling it for a HEX encoding, or increasing its size by about 1/3 in the case of base-64)
3: If you use an encoding like base-64, then you lose a valuable property of a bytestream: the byte boundaries of the source data no longer fall on byte boundaries in the encoding. This means, for instance, if you have five bytes of source data ready to send over the pipe, you can only send four of them: The first three bytes of the payload are encoded as four bytes of base-64, then the next byte and the first four bits of the next byte are sent as the next two bytes of base-64. You can't send the next byte of base-64 (containing the remaining four bits of the fifth source byte) until you have the sixth
source byte ready to go. If this doesn't sound like a big deal, consider the case where a program is streaming data it receives over the network, or over a pipeline from another program. In that case, you don't know when or even if
that sixth byte is going to be available. But the next program in the pipeline may nevertheless be able to take action if it gets that fifth source byte: for instance, after seeing the fifth byte it may be prepared to close the pipe (causing a SIGPIPE on the sender next time it tries to send more data), or it may have enough information to send another piece of data to the next program in the pipe after
It is possible to define an encoding within the bounds of CSV that will provide the receiver with information like the fact that the string of characters in the field represents some binary encoding of data, or that the binary blob actually should be interpreted as some media format or other. But the point is that since CSV doesn't provide even the most basic mechanisms for doing so, then to provide that functionality you'd have to essentially invent a new format, using CSV as nothing more than the base-level encoding for that stream. You would either need to get all the various utilities to recognize this common format you've implemented on top of CSV (meaning it's not really "CSV" you're establishing as the baseline format, but rather something on top of it), or else let them all remain ignorant of the relevance of that binary encoding, which is equivalent to not providing any better support for binary data than what already exists in the shell and in its associated utilities.
And if you did invent that layer on top of CSV, containing your conventions for establishing different data types, etc. - a fair bit of that work would be essentially reinventing what JSON's already providing. Things like nested structures, possibly with named fields (as a rudimentary way of attaching metadata)
Nested structures are also possible, however, it's a mess and nesting just isn't useful enough to bother. Especially since relational structures accomplish the same goal (and are easily expressed in CSV.)
I don't agree that nested structures aren't "useful enough" - and your criterion there, "useful enough to bother", is skewed by the fact that you're starting with a format that doesn't naturally lend itself to that kind of functionality.
For instance, what's "useful enough to bother" using nested structures in JSON? It's trivial. You just use 'em. Anything that processes JSON will understand that it's a nested structure, and any JSON parser will be able to tell you exactly what the correct (decoded) payload for each field is.
In CSV, it involves questions like, what character escaping mechanism are you using? What do you use as a secondary delimiter for the nested fields? How do you encapsulate the secondary delimiter within a value field? Such decisions are beyond the scope of CSV itself. Again, it's a matter of having to define a set of conventions on top of
CSV to get that kind of functionality.
Expressing a nested structure by linking to other records has a few problems: for instance, you may not want to treat the nested data the same way as you treat the record containing it. If you're filtering records out
of the stream, for instance, you'd want the nested data to go away if you filter out the record containing it. But if your "nested" data is really just another record referenced in the field of another record in the stream, then your filter program doesn't know that, and if the "nested" record passes the filter rules then it'll go to the next stage of stream processing.
Using nested structures in data streams isn't "useful enough to be worth the trouble" right now
because it really is quite a bit of trouble. Using any kind of delimiter means either you can't have that delimiter as part of a field payload, or you need to define a syntax to work around that, and then parse
the stream instead of just scanning for delimiters. To nest structures, you either need some kind of encapsulation syntax (which pretty much requires true parsing), or another delimiter for each level of nesting (which, again, either means another character you can't encode as part of a field payload, or an escape syntax that you'll need to parse.)
Working with a stream format you need to parse sucks right now because most tools don't include stream parsers. Most tools don't include stream parsers because there's no consensus on a "common" meta-format that these parsers would target. The only alternative would be to bundle into each program sufficient functionality to allow the user to define
a parser that the program would use to process their input. But that's a complicated thing to implement, a complicated thing to use, and if different utilities had different implementations of that parser, with different variations on the syntax used to define the parser... That would turn into a major headache quickly. Hence, the whole idea here, of establishing a "common meta-format" for the shell and its core utilities (which is, some would claim, an idea that flies in the face of everything Unix stands for) which would eliminate the whole problem of telling a consumer program how a connected generator program has chosen to encapsulate its value fields. The underlying mechanisms of creating and parsing those streams are no less complicated, but the fact that the various programs would already support
such a format (and the corresponding data model - the common set of ideas
about data structures that come with the format) means that the user doesn't have to explicitly code that stuff into his scripts.
If we're choosing a meta-format for this job, then we have the opportunity to pick something that would solve so many problems that it could dramatically increase the power of the Unix shell at the same time. All kinds of problems that are presently "too difficult to bother with" could suddenly become much easier. Potentially so
much easier that we won't even think of them as "problems" at all. I think that is the situation with nested structures: if we make them easy to use, we'll stop avoiding them, and take advantage of them in situations where it makes sense to do so.
I have to believe nested structures are useful because we use them all the time
in other programming languages, in our filesystems, in our documents... It's an idea that clearly works. You have to figure, also, that one common scenario for nested structures will be that you're actually just encoding and passing over the stream a piece of data that was created somewhere else, data which was originally represented as some sort of nested structure. If the streaming format supports those data structure concepts, then the translation process is pretty straightforward, and if two different people were to guess
how that structure would be translated, it's likely they'd arrive at the same answer, and be unsurprised to find it's the same answer the computer came up with.
CSV is extensible. If I wanted to I could add another field, it would just have to come last (since CSV uses location to determine references, where as JSON and XML use names to find things.)
There are all kinds of scenarios where this just isn't adequate. We've seen plenty of them already in the classic UNIX tools, most of which do use some kind of line-based, "simple" delimited-field format.
A very basic example is, what happens to the format after a lot of these additions/removals are performed? You wind up with a bunch of unused fields kept around as vestigial place-holders, and a bunch of new fields tacked on to the end.
Or what if two different people
, working on diverging implementations of the same program, both add a new field to that stream format? Naturally, they'll both add the new field on the end, and in both cases it'll be the Nth field. This is the sort of thing you might get from, say, different implementations of "ls -l" or "ps". Scripts working on the output of those utilities won't be portable because that Nth field has different meaning depending on which version of that utility is installed.
By contrast: if the fields are marked with some kind of tag, something non-ordinal with plenty of available space for meaningfully defining new tags, then there's at least a pretty good chance the two implementations won't clash. You could still get scenarios like both implementations defining a field called "extended-permissions" or something similarly generic, but there's at least a decent chance that they won't, and (if they do) it's relatively easy to correct the issue by getting the implementers to coordinate a bit and avoid reusing each others' tags (or respect a tag naming scheme that would keep their fields from clashing - like a domain name-based scheme, for instance) unless they're truly compatible.
The questions can go deeper, if you are willing to go that far: for instance, the PNG format makes an effort to provide answers to questions like, "If I don't recognize this field, can I just ignore it?" (Personally I'm not sure if I will ever go that far - though it's a nice feature to have if you want to make scripts really reliable. You can then do things like say, "I don't recognize this field, but it's marked as ancillary so it's OK." instead of "I don't recognize this field called 'comment' - PANIC!" Though I think the question of whether a piece of data is "ancillary" may be too complicated to answer with a simple Boolean value.) Another useful feature might be to name fields in such a way that, even if you don't recognize the field, you can know what sort
of data it's providing: for instance, a prefix like "perm" might mean "this field tells you who can read/write/execute this file" - while the field itself defines some permission scheme beyond the scope of classic Unix - like access control lists or whatever. Which, incidentally, would be another case where nested structures would be useful.
(1: "getting it right" is one of the main reasons I want to create a new shell in the first place. There's all these cases where creating a shell script to do a job is just as easy as it should be - as long as you don't hit one of those cases that breaks the simple version of the script. Getting it right for all cases is harder, especially these different utilities, even different versions of utilities, don't agree on even a basic set of rules about how streams are formatted. My aim is to make a shell that helps to correct this situation. Using type information means the shell can tell the user when some things aren't going to work... Reliably delineating the boundaries of value data means the user doesn't have to cope with quoting rules when writing a stream-processing script... And giving the serialization format some flexibility, in the form of named fields and nested structures, gives the users greater expressive power to do the things they want to do, without resorting to arcane measures.)
(2: The regional issues surrounding the use of the comma as a decimal point in some countries has given me cause to reflect on the syntax I'm designing for my shell: ideally, I would want people in those regions to be able to use comma in the way they're accustomed to using it. If comma is a decimal point where they live, then it should be a decimal point in their shell. But as it stands, in my design, I use comma as a high precedence command/value separator, I rely on it heavily. Semicolon performs a similar job but it has lower precedence in the syntax. I could potentially implement a mode in which comma is not a command separator at all, and in which it serves as a decimal point in the numeric syntax... The main problem there is that if I provide regional "modes" for the syntax itself, then that impacts cross-region script compatibility. I have given some thought to problems of writing reliably portable scripts in general: things like a "portable script" mode, which would include flagging auto-conversions as portability errors - so the portable script mode could dictate that comma is a separator, or require that the comma mode be explicitly stated in the script... Things like that. Apart from that, there would just be issues of the syntax being slightly less convenient to use without the comma as a separator (having to put parens around things... In any case, though, I don't think the serialization format, if it's text-based, should have to support a similar mode switch. The serialization format shouldn't generally be something people work directly with, so it's not subject to UI considerations like L10N.)