My Unix CLI manifesto, aka why PowerShell is the bees knees

Please compose all posts in Emacs.

Moderators: phlip, Moderators General, Prelates

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Fri Jul 13, 2012 2:43 am UTC

tetsujin wrote:Another solution would be to use something like FUSE to create filesystem-level access to xattrs. Requires more setup at the admin level, of course, but it could be one of the cleaner solutions overall.

Huh. That's actually a really good idea. Now I'm wondering if I should just do that instead. Now I'm wondering if perhaps there is no shell support that's really needed. Do you know offhand how hard it is to write a wrapper file system like that?

I guess the things to consider, though, are what kinds of things do you want programs to be able to do with data from xattrs? Simple cases like piping data from or to them are pretty straightforward. Providing a full (random-access) file interface to xattr data and passing in the xattr as though it were a filename is a bit more complicated. I'm not sure I'd bother going to that extent. :)

I'd really like to be able to do that though; it'd be nice to be able to open one in an editor or something.

I thought process redirection was pretty neat and I was kind of surprised to learn about /dev/fd*.

Yeah. I'm actually not quite sure how to use it and such programattically (like how you make one), but it is pretty cool. Also, on some other systems it uses other mechanisms, like opening a file somehow in the shell's process and then passing a /proc/pid/fd/# path. (I'm not sure

It seems like kind of a hackish way to get a file descriptor reference into a program that (probably) isn't written to accept numeric FDs for its file arguments... But at the same time it's a pretty elegant approach.

Actually I tend to go with the elegant view. I mean, in some sense it just unifies things: it means there doesn't have to be two namespaces, fd numbers and paths.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Fri Jul 13, 2012 7:20 am UTC

EvanED wrote:
tetsujin wrote:Another solution would be to use something like FUSE to create filesystem-level access to xattrs. Requires more setup at the admin level, of course, but it could be one of the cleaner solutions overall.

Huh. That's actually a really good idea. Now I'm wondering if I should just do that instead. Now I'm wondering if perhaps there is no shell support that's really needed. Do you know offhand how hard it is to write a wrapper file system like that?


Afraid not. Last time I took a serious look at userspace filesystems on Linux, it was before FUSE. I forget what the old library was called. Maybe just "userfs", I forget.

(EDIT): Took a quick look and it seems pretty simple on the implementation side. There's a python binding, too. Basically you just have to create a program that implements the various operations - like you provide back-end implementations for open(), read(), etc. The open() backend needn't do much, mostly just return an error code that indicates whether the operation succeeds or not (though naturally your fuse module could go farther, actually record "this file was opened" and use that knowledge to speed up the implementation. The backend implementation for calls like read() and write() takes the path to the file (rather than any kind of numeric identification of the file) - so it seems pretty straightforward.

Documentation kind of sucks, but the IBM page I linked is pretty helpful, as are the example filesystems included with the fuse GIT tree...

If you wanted to be really tricky, you could implement the xattr interface in FUSE, then overlay it onto the existing filesystem with unionfs... Though personally I'd hesitate to do that, for fear of possible performance penalties.

I thought process redirection was pretty neat and I was kind of surprised to learn about /dev/fd/*.

Yeah. I'm actually not quite sure how to use it and such programattically (like how you make one), but it is pretty cool. Also, on some other systems it uses other mechanisms, like opening a file somehow in the shell's process and then passing a /proc/pid/fd/# path.


It's pretty simple. /dev/fd/ is just a link to /proc/self/fd/ (on my system, anyway). When a process opens the directory will see a list of file descriptors that are open for that process.

Code: Select all

$ echo /dev/fd/*      # Whether echo is internal to the shell or external, the glob is expanded by the shell, so we see the shell's set of open fd's.
/dev/fd/0 /dev/fd/1 /dev/fd/2 /dev/fd/255 /dev/fd/3
# That's stdin, stdout, stderr, and the last one is the file descriptor the shell used to open the directory /dev/fd/.  (fd 255 in bash is the TTY.)
$ ls /dev/fd 7< /dev/null 8< /dev/null             # We're not globbing, so the list of fd's we get back is generated by the ls process...
0 1 2 3 7 8
#  Again, stdin, stdout, stderr, the fd used to read the directory, and the two files we provided with redirection syntax.


To get a filehandle to a process the shell is launching is pretty simple: it's just like any other form of redirection. Open the file, fork the new process, use dup2() to change the numeric fd to what you want, and then exec() the program. As long as the file isn't set to close on exec() or something, the program running in the new process will inherit the open file descriptor. And if the file is open in the new process, then a corresponding entry in /dev/fd/ will exist.

This kind of raises another situation I think isn't necessarily healthy in the existing Unix shells: if you have a file open in the shell, any command you run will inherit that file descriptor by default. I tend to think when you run a command from the shell, the new program shouldn't inherit any file descriptors you don't explicitly pass to it (apart from stdio, I mean). To address that I want to deal less in the shell with numeric fd's and more with symbolic ones.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

la11111
Posts: 6
Joined: Wed Jul 04, 2012 8:23 am UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby la11111 » Tue Jul 24, 2012 12:52 am UTC

So, I've been off learning python these past couple of weeks, but I'm back now with a bit of knowledge, and I've done a lot more research on this subject. I'm really glad that there's so much activity on this thread!

I'm in agreement with those who are concerned about the idea of trying to create an entirely new shell syntax that radically departs from current norms. As it has been pointed out many times throughout this thread, writing a grammar for a programming language is very difficult. There's a reason why humans don't write parsers! ;P There are thousands of corner cases that you would never even imagine at the outset. Not to mention, the usability aspect of it. And, in a shell, the main goal is usability. Redefining operators that have been used in specific ways for 40 years is probably not going to help this thing catch any steam.

I use powershell almost every day, so that's what I'm using as a reference (and obviously, bash on linux). If you really look at what's going on with powershell, it's almost verbatim c# - or at least, a scriptified version of it. My feeling is that, since powershell is a .NET program, what probably goes on behind the scenes is: there's a powershell parser, which is a modified version of the c# parser; I'd think that the AST that results from parsed posh code will probably be a c# tree, either that or just be translated to a c# tree - which would then be byte-compiled and sent off to the .NET vm just as if it was a block of c#.

At least, If I were designing a windows shell based on .NET, that's how I would imagine doing it...

Another good example of this is BeanShell for java. I haven't looked deeply into it, but it seems to work with the same basic concept.

--

Really, my point is - there's no need to re-invent the wheel here, especially since we're using python. Python lays its guts out on the table for us to use and abuse to an extent greater than any other interpreter system that I'm aware of. It provides a built-in library for working with the python lexer and ast's, as well as a pure python lex/yacc implementation. So, instead of hassling with creating an entirely new language from scratch - why not just use a subset / derivation of python itself? Use the test to take the test?

Python is already super simple, has a massive user base, and not to trivialize the scope of what i'm suggesting, but I don't think it would be necessary to change it too heavily to transform it into a pretty nice & usable command shell (especially as a subset language).

So, that's what I've been looking into. I've explored a few different routes, but the one I've found to be the most promising is python4ply - essentially, it's a python interpreter, written in python, using only ply and the built-in ast libs. I've been working on wrapping my mind around it, but it seems like it would be pretty easy to distill it down into a python-compatible shell syntax... at least, easy compared to creating an entirely new one from scratch ...

some changes that might be useful for starters:

-block delimiters to make one-liners possible -- please don't crucify me
-add constructs like $variable, $(eval and substitute returned object), `bash eval`, to make "$interpolation" possible
-"command -a arg" becomes command("-a arg") or similar
-possibly use eq instead of ==, etc, to free up the > >> < << operators for redirection
-pipeline: python already has a '|' operator that's used as a set union operator, and I've seen this used to implement pipeline functionality (i.e. python-grapevine); it might be possible just to wrap non-builtin's with a Popen() and create a set consisting of each line returned to implement pipes which work between built-ins and external programs

other than that, I'd imagine that we could keep the rest of it as it is. Not to mention, full python compatibility could be maintained rather easily. Most of the neccessary changes would be entirely syntactic and could pretty much just be substituted.

Ultimately, I hope we don't get too caught up in the shell syntax just yet, as I think it's a bit premature to start trying to nail that down. The most important part is definitely going to be the data interchange mechanism. Second to that, I think, will be the re-implementation of all of coreutils (which I think should also be done in pure python to the greatest extent possible, so that this could be cross-platform) ... until a usable base system has been implemented, it's going to be hard to determine how usable the shell syntax is anyway.

So, that's my RFC. ;) flame on!

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Thu Aug 02, 2012 6:24 am UTC

So, anyway:

First, about designing shell syntax in detail: I think it's actually very important to do that as soon as possible. The reason why I say this is because there's all kinds of "great" ideas I had for the shell, but when I tried to figure out how to make them work in terms of actual syntax, I discovered that many of them weren't really workable. The effort to get things nailed down as soon as possible is an important part of keeping the rest of the design grounded in some form of reality.

Second, regarding the redefinition of time-honored syntax: to a certain extent I think it's unavoidable if you want to accomplish anything new. It's bound to trip people up, and it's true that anything like that will have a certain amount of damage on a new shell's potential for catching on. But I think in some cases it's worth it. The whole point of creating a new shell is to make it different from what came before, different in useful, exciting, even compelling ways. Doing that may mean breaking some established, understood rules about how the shell works. Decisions like that should be made carefully, and scrutinized and reviewed, and rescinded if need be - but if a decision like that actually is worth the trouble it causes, then it should be embraced.

(As it pertains to my design... I am open to the idea that I may have to rescind certain design decisions that I feel attached to - like hijacking the angle bracket characters for various purposes (including as comparison operators) - as has been pointed out in this thread, people are very accustomed to those being redirection syntax, so making them anything else could be a very hazardous decision. I will give it more thought.)

As for Python as the basis of a shell language - personally I think it's not the right choice. Python carries a lot of syntax rules of its own that wouldn't jive with the calling syntax of existing programs. In a shell people are used to typing the command and its arguments with very little decoration. "cvs commit" and so on. In a Python-like language, undecorated names are looked up in the symbol table... You can do things in Python like change how symbol lookup works, so that it would be able to include a path search as part of that lookup - but it's not a context-sensitive thing, so it'll look up "commit" too. It's the same problem I discussed at length about my design, how treating undecorated names as symbols to be looked up on the search path was actually incredibly destructive to the way people were used to working in the shell. A lot of existing programs take arguments that are undecorated names, and if the shell hijacks those, it makes a big mess of things. Python, properly extended and with the right set of utilities behind it could be a usable shell, but its usefulness would become rather more limited once you get to cases of running programs that weren't designed with that shell environment in mind. The process of calling these programs from the shell can't be made awkward...

Another issue with using Python as the basis for a shell is more of a political issue. Basically, if the shell is based in Python, then Perl users or Ruby users might feel that it's not for them. It seems to me that if you even let on that the shell has ties to Python, then people's perceptions of it change. Python people may embrace it but everyone else will feel that it's a "Python thing" - which is an unfair judgment but nevertheless (IMO) likely. (Though, on the other hand, is it really an unfair judgment? Python will always be the environment that ties in to the Python-based shell the best, any other programming language is bound to be trailing behind.) An implementation in C or C++ is more likely to be regarded as "neutral" - people aren't as likely to think it's harboring favoritism toward any one programming environment.

Lucky me, I've got a brand-new laptop (A Thinkpad covertible tablet, with a damn nice keyboard!) to replace my old broken junky netbook, so hopefully now I'll be able to make some progress on my shell implementation again.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

la11111
Posts: 6
Joined: Wed Jul 04, 2012 8:23 am UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby la11111 » Sat Aug 04, 2012 2:20 am UTC

Yeah, I probably should have waited to comment on this thread until I'd done more research.

I'm going to reach back into this thread a bit, so bear with me.

As for Python as the basis of a shell language - personally I think it's not the right choice


I partially agree. But also, a lot of its constructs are already very simple and bash-like. for x in y, no curlies, etc. That said, my original approach was totally backwards.

about designing shell syntax in detail: I think it's actually very important to do that as soon as possible.


I guess I'm the opposite way - I want to get the big stuff nailed down as soon as possible ;) not very detail-oriented here.

I've been researching parsers, grammar writing, etc. over the past month, which is part of the reason I'm taking that viewpoint. To illustrate - I wrote a program to generate graphs from context-free grammars to help visualize their structure; Here's python -> [ http://la11111.devio.us/python_grammar.svg ] if you look, you can see a pretty clear delineation between the different levels of the syntax, with control structures at the top and the nitty-gritty in it's own little world at the bottom. There's a lot of twerking you can do down there without disrupting any of the higher-order structures. edit although upon further investigation - especially with lr parsers like yacc - it seems that a good approach would be to work from both directions simultaneously. So you do have a point there!

I discovered that many of them weren't really workable


It's quite possible to do things like use '>' as both a redir and a boolean operator (and not a bad idea, IMO). As a boolean operator, '>' would be nested within a test, which would then be nested within a statement. The '>' redir operator is a statement delineator, equivalent to semicolon. So from the perspective of a parser, there's three levels of scope separating them.

how to reliably access local vs. global pipelines (i.e. pipelines that are defined at the scope of the whole job for the purpose of patching together a non-linear job vs. pipe names established internally by individual programs.)


RPC system


Interesting... I personally think that it wouldn't be a bad idea to create a pipeline proxy to allow for (transparent) tunneling of pipe-like channels (i.e. over tcp), connecting up runspaces in different scopes of execution, generators, etc. As has already been established in this thread, pipes aren't exactly the most complicated data structures in the world, and to programs that use them, they look just like any other file descriptor. Also, 64K is a pretty small buffer nowadays. So I'm thinking, an RPC system with anonymous pipe backwards-compatibility. I think that this would be a good place to employ FUSE.

Most of the work I've done so far has been to define and implement a binary interchange format for the shell. Apart from my distaste for text-encoding everything there are other reasons I think XML and JSON aren't suitable for use as the interchange format...


Strongly disagree. Ever heard of DCOM?

And what about when people start passing arbitrary data structures on this object pipeline? There's no way you can come up with a generic pretty-printer that performs well on trees, forests, graphs, and whatever other wacky data structures people use, while the programs using those structures would know how to interpret and represent them.


* I do think it's possible to write a reasonably generic formatter


Agreed. If you're talking about pickles, especially - you'll really only have arrays and hashes. So -- a list, a table, and (at most) a graph( == tree). I agree with the idea of allowing each program to specify its own preferred output format in a header property of its output objects. And hey, out-tree would be a pretty cool command :)

Lucky me, I've got a brand-new laptop (A Thinkpad covertible tablet, with a damn nice keyboard!) to replace my old broken junky netbook, so hopefully now I'll be able to make some progress on my shell implementation again.


nice! when all our powers combine ....

Just throwing this out there - I think this thread deserves its own wiki. This format is getting cumbersome.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Mon Aug 06, 2012 6:37 am UTC

la11111 wrote:It's quite possible to do things like use '>' as both a redir and a boolean operator (and not a bad idea, IMO). As a boolean operator, '>' would be nested within a test, which would then be nested within a statement. The '>' redir operator is a statement delineator, equivalent to semicolon. So from the perspective of a parser, there's three levels of scope separating them.


The problem there is that when you have different contexts to access the different operators, then there will be things you can't do in one context or the other (at least not easily) that you can do in the others. The more alike the two contexts appear to be, the greater a problem those differences will pose.

Most of the work I've done so far has been to define and implement a binary interchange format for the shell. Apart from my distaste for text-encoding everything there are other reasons I think XML and JSON aren't suitable for use as the interchange format...


Strongly disagree. Ever heard of DCOM?


I don't care about DCOM. :)

Well, it's important to be clear about context here. Clearly XML and JSON can be used as interchange formats for data exchange between programs. That is largely the reason they exist in the first place. But specifically in the context of an interchange format for the shell there are some reasons I am heavily biased against anything text-based.

The big one, for me, is that the shell interchange format should be able to encapsulate anything without a large overhead. If the format is text-based then this means encoding with something like base64. That's probably my main issue with using XML or JSON as the primary interchange format for the shell. I want to be able to package just about anything, without necessarily having to filter it.

As a secondary interchange format I think supporting XML or JSON is a great idea, because people are going to want to do that... There's already an abundance of tools for working in those formats, people are less likely to feel they're tying themselves to the new shell if they can do their I/O in a format that wasn't invented for the shell, etc. - and if the program design follows a certain discipline, and the shell can be informed that this program is doing its I/O in JSON or XML then that's nearly as good as if the program were written to use the shell's "native" interchange format. The shell will just translate the data stream transparently in cases where it needs to.

nice! when all our powers combine ....

Just throwing this out there - I think this thread deserves its own wiki. This format is getting cumbersome.


I don't know about a wiki at this point. Seems like me and evanED are going in two fairly different directions, and you may be going in a third. That, to me, says it's time to talk, not to collaboratively edit documents.

Could be we're not ever going to go in the same direction. Could be we're both just going to implement our own thing and see how it turns out. That's not a bad way to go, really.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

la11111
Posts: 6
Joined: Wed Jul 04, 2012 8:23 am UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby la11111 » Mon Aug 06, 2012 9:45 pm UTC

That's not a bad way to go, really.


agreed.

Out of curiosity, what sort of format are you proposing and why? I haven't thought it through all that thoroughly, as I'm not there in the process just yet, but I'm having trouble imagining a situation where it would be necessary to handle anything more complex than structured text. Binary data already has an embedded structure, so I guess I don't see the value of encapsulating it further rather than just sending it as a raw stream through a standard pipe, socket, etc.

I guess also, as a related question to the one above: what problems do you have with json or xml specifically? (xml i can see, as the overhead would be hundreds of percent, but I'd think that with json or equivalent, the impact would be negligable). Maybe I'm missing something, or maybe you're just imagining a completely different use case than I am. Either way, I'm interested to hear what your motivation is.

I don't care about DCOM. :)


Starting with windows 8 / powershell 3, they're moving away from all that towards a standardized http/*rpc-like interface for their cim stuff, so neither do i :) That's gonna be awesome.
I am a mad scientist - don't take the above too seriously.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Tue Aug 07, 2012 4:40 am UTC

la11111 wrote:Out of curiosity, what sort of format are you proposing and why? I haven't thought it through all that thoroughly, as I'm not there in the process just yet, but I'm having trouble imagining a situation where it would be necessary to handle anything more complex than structured text. Binary data already has an embedded structure, so I guess I don't see the value of encapsulating it further rather than just sending it as a raw stream through a standard pipe, socket, etc.


At a very basic level, there's two reasons you'd ever want to encapsulate a piece of data. First of all, you might want to send more than one of them, in which case the encapsulation can delimit the individual pieces of data. Second, you might want to include some other data, structural context, or metadata so the receiver knows more about what to do with the data when it's received.

A very simple case is, you're sending a chunk of binary data to another program and you want the receiver to know (and not have to guess) the type of that data. Encapsulation in this case provides just that metadata, the data stream type.

But there's no reason the shell and associated utilities shouldn't be capable of handling, for instance, a list MPEG video streams or ISO filesystem images passed in through an input pipe. So, again, delimiting the values comes in handy there.

The structure I'm designing has a bunch of built-in types plus the ability to identify others by name - as well as various mechanisms for encoding binary data blocks to provide delimitation of values and termination of the stream. So, for instance, if you're careful about how you set the encoding options, you could package a program's output in this format just by prepending a header (which means that the shell can write this header to the stream, and then pass the input end of the pipe to the new child process and let it do its thing.) Or if you know the size of a binary field before you start writing it to the file, you can specify its size and not have to do any encoding on it. But the format still needs work, I think. Among other things, I think I'd better be sure that it has the same expressive flexibility as XML - and once it has that, I may want to re-think how I designed in other features.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Fri Aug 10, 2012 6:04 am UTC

OK, I have some stuff to report!

I have a utility to print out directory listings in a JSON format, a version of stat that outputs JSON, some additional utilities to filter JSON entries in a couple different ways, and a utility to display JSON entries in a tabular format.

Here are some quick demos:
Spoiler:

Code: Select all

$ ./list-directory | ./display-table
in-directory  inode    kind               name         
------------  -------  -----------------  --------------
.             1310954  [u'directory']     support       
.             1317706  [u'regular file']  stat~         
.             1310948  [u'directory']     .             
.             1310949  [u'directory']     coreutils     
.             1317615  [u'regular file']  display-table
.             1310942  [u'directory']     ..           
.             1317710  [u'regular file']  modify       
.             1317708  [u'regular file']  modify~       
.             1317628  [u'regular file']  sort-objects 
.             1317707  [u'regular file']  stat         
.             1317716  [u'regular file']  select~       
.             1317610  [u'regular file']  list-directory
.             1317717  [u'regular file']  select


The rendering of the kind field is bad currently, but I've been too lazy to improve it. I'll get there one day. (This listing is done without stat-ing each file, BTW, which is why list-directory and stat are separate utilities, at least for now.)

Code: Select all

$ ./list-directory | ./modify --select-key=name,in-directory | ./display-table
in-directory  name         
------------  --------------
.             support       
.             stat~         
.             .             
.             coreutils     
.             display-table
.             ..           
.             modify       
.             modify~       
.             sort-objects 
.             stat         
.             select~       
.             list-directory
.             select


Code: Select all

$ ./list-directory | ./modify --select-key=name,in-directory | ./sort-objects --key=name | ./display-table
in-directory  name         
------------  --------------
.             .             
.             ..           
.             coreutils     
.             display-table
.             list-directory
.             modify       
.             modify~       
.             select       
.             select~       
.             sort-objects 
.             stat         
.             stat~         
.             support


Code: Select all

$ ./list-directory | ./sort-objects --key=name | ./stat  | ./modify --remove-key=inode,device-type,number-hard-links,ctime,owner-uid,owner-gid,access-time,modification-time,device,filesystem-blocksize,number-blocks | ./display-table
in-directory  kind               name            permissions  size
------------  -----------------  --------------  -----------  ----
.             [u'directory']     .               509          4096
.             [u'directory']     ..              509          4096
.             [u'directory']     coreutils       509          4096
.             [u'regular file']  display-table   509          159
.             [u'regular file']  list-directory  509          148
.             [u'regular file']  modify          509          152
.             [u'regular file']  modify~         436          150
.             [u'regular file']  select          509          152
.             [u'regular file']  select~         436          152
.             [u'regular file']  sort-objects    509          150
.             [u'regular file']  stat            509          150
.             [u'regular file']  stat~           436          150
.             [u'directory']     support         509          4096


OK, this last command is a bit long, but I have some plans or semi-plans for things which could make them better. Also not ideal is the rendering of the permission bits (instead of rwxrwxrwx stuff) and the size (instead of using KB, MB, GB, etc.).


So I think that's pretty cool.

I've got a couple things I'm not sure what I want to do about yet, and may ask for some feedback at some point. This is all up on my Github (you'll also need this), but at the moment the documentation is, um, in the spoiler above. :-) So you'll have to figure out WTF you have to do to get things to work.

(OK, here's a little information for Linux people: run python setup.py install (with an optional --prefix) in the pyreaddir directory. Then go into futureix/coreutils/src and edit the shell wrappers there so they set PYTHONPATH to whatever it should be set to for you. Some things which are advertised to work may not, like recursive file listing. That may or may not work.)

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Fri Aug 10, 2012 3:48 pm UTC

Seems like display of the permissions field is kind of messed up, too (octal permissions being displayed in decimal... So "509" instead of "0775") Though it's early, yet. I'll definitely check it out when I have some time.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Fri Aug 10, 2012 3:56 pm UTC

tetsujin wrote:Seems like display of the permissions field is kind of messed up, too (octal permissions being displayed in decimal... So "509" instead of "0775")

Yeah, that's part of it.

Problem is, I don't really know how to do it right yet; that's one of the things I'm trying to work out. Right now there's no mechanism for saying "display this field in octal", so I'd have to either print all integers in octal or put some code in the display-table utility to say "oh, I'm printing out a field called 'permissions', print it in octal". I don't really want to do the latter because I would rather provide a general mechanism, but I haven't really figured out what I want that to look like. (It'd also be nice to be able to have it print, say, "foo/" for directories by synthesizing information from the name and directory column. And omit/include columns by default so you don't have to list a bajillion columns to drop to get everything to fit in one screen-width, beyond which the display-table script doesn't do anything useful at all. :-))

(Hmm, on thinking I guess I could output the permissions value in the stat program as a string, and then the special case code would at least be in a place that sorta makes sense instead of the renderer.)

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Fri Aug 10, 2012 5:28 pm UTC

What if you bundled style data with the table? If you could do that (and be sure that the various tools in the chain will pass it along) then the main challenge would be, I guess, defining a style language you can use that will let you say things like "present these values in octal"...

I'm not sure how I'd handle the problem, personally. The octal representation is nice and compact (though these days I think we have to consider other security models besides the classic one) and close to the implementation, and easy to process... So personally I wouldn't want to encode that field as a string field (as opposed to a numeric field - though the impact there is less since JSON is text-oriented anyway) just to get the desired formatting. I think maybe what I'd do is define a "unix permissions" data type for the shell, and then when writing the table, indicate that although the field is encoded numerically, it represents a "unix permissions" object. That data type can then control its own presentation when the table is displayed.

...Though that approach has its problems. I mean, if the permissions field is wrapped in a data type, does that mean the user has to invoke that constructor in order to make a permissions field value that they can assign to a file's permissions? Does the table need to print out that constructor syntax for each value in the table? Or can the "numeric unix permissions" data type be transparent? (In Python you can make rather transparent data types, of course - for instance by instantiating an object and then replacing its __repr__ method - I don't have an entirely clear idea of how something like that would play out in my shell design, though.) The whole "data type", in those terms, would be nothing more than a way of bundling that style data with a value.

Another approach might be to simply write the number to the stream in octal format, and mandate that the original formatting of a value should be preserved when the value passes through a filter. That would work well with a textual format like JSON - since it is text it includes the textual representation of the number. You could, in effect, take the parsed value of the number as the data for the field, and the specific way it's formatted as a piece of metadata which you seek to preserve and use later. Except that, technically, integers in JSON are only allowed to be decimal, which may make the approach unworkable if you're hoping to remain compatible with all the JSON libraries and tools out there.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Fri Aug 10, 2012 7:43 pm UTC

(Incidentally, dates are another thing like the permission where what gets displayed to the user and what is in the internal representation are probably best as different things. "Aug 10, 2012, 12:30pm" is fine for a user, but now you need to parse it and worry about precision and stuff if you want to something with the actual value.)

tetsujin wrote:What if you bundled style data with the table? If you could do that (and be sure that the various tools in the chain will pass it along) then the main challenge would be, I guess, defining a style language you can use that will let you say things like "present these values in octal"...

That's sorta the idea I have in mind. I don't show any of the raw output above, but if you were to look at it, objects from list-directories have embedded type information, currently

Code: Select all

{
    "type name":               "file info",
    "default display columns": ["name", "path", "kind"]
}

but this is ignored by the rest of the toolchain because I decided I didn't really know how I wanted to do it yet. (I'm not sure if I want to attach the type description to every object, to just one object and then have it "persist" until it's replaced, or by having an entirely separate object in the stream. And I'm not sure what I want to support for rendering and such.)

Another approach might be to simply write the number to the stream in octal format, and mandate that the original formatting of a value should be preserved when the value passes through a filter. That would work well with a textual format like JSON - since it is text it includes the textual representation of the number.

Well, not really; if it's actually a number ({"perm" : 0777}) then it's a number and will be parsed as such, so you can't keep formatting, as you mostly say later. You'd have to do what I tacked on at the end of an earlier post and output it as a string ({"perm": "0777"}) which, of course, is totally doable, but isn't always what you want. (Again, dates are an example where this isn't very good.)

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Fri Aug 10, 2012 11:26 pm UTC

EvanED wrote:
Another approach might be to simply write the number to the stream in octal format, and mandate that the original formatting of a value should be preserved when the value passes through a filter. That would work well with a textual format like JSON - since it is text it includes the textual representation of the number.

Well, not really; if it's actually a number ({"perm" : 0777}) then it's a number and will be parsed as such, so you can't keep formatting, as you mostly say later. You'd have to do what I tacked on at the end of an earlier post and output it as a string ({"perm": "0777"}) which, of course, is totally doable, but isn't always what you want. (Again, dates are an example where this isn't very good.)


There's no reason you can't keep formatting, you'd just have to make sure your tools actually do keep formatting. It just takes discipline, and clear statement of the fact that this is an expectation for programs that will do their thing in this environment. But it'd be a lot of work to make sure programs retained specifics of representation, and there's only so many problems in the world that can be solved with that kind of approach... So it's probably better to focus on a more general solution to the problem.

To me the way to solve the date thing is to make that, too, a data type of its own. (Meaning, at the serialization level, there's something that explicitly says "this is a datestamp", and in the shell there's a clear concept of what that means and how to work with it.) To me it's not as complicated a question as the permissions field thing, in part 'cause, unlike an octal number, it's hard to do anything too useful with a date field (such as comparison, subtraction/addition, time zone adjustment, etc.) without it being wrapped in a class.

Tables in general have been a bit of a puzzle for me in my design. In a problem like this, the output of "list-files" is a list of data structures, with each structure containing some information about a file. For display purposes it's worthwhile to present this in a table format, as it's much easier to read. But apart from (perhaps) how they are displayed, there's no real difference between an "ordinary" list of data structures of the same type, and a similar one that's to be displayed as a table. So I guess one way to approach the issue would be to say that table-style just kicks in when displaying a sequence of structures with the same fields...

Though I expect not every program's output will necessarily look nice with the value renderer's defaults (or conversely, I don't think it's necessarily practical to come up with a set of value renderer defaults that will work well for all programs) so that kind of brings me back to the idea of bundling style info into the stream.

Yet another option would be to explicitly select the renderer at the tail end of the pipeline. I'm not as fond of this idea because when the tail end of the data stream is headed to the user's console, there should be a good default, and I'd like to make that default as good as possible. But passing the style info as metadata, again, leads to the questions of when filter programs should be expected to pass on that metadata and when it's OK (or desirable) for them to skip it.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Sat Aug 11, 2012 12:11 am UTC

tetsujin wrote:There's no reason you can't keep formatting, you'd just have to make sure your tools actually do keep formatting. It just takes discipline, and clear statement of the fact that this is an expectation for programs that will do their thing in this environment. But it'd be a lot of work to make sure programs retained specifics of representation, and there's only so many problems in the world that can be solved with that kind of approach... So it's probably better to focus on a more general solution to the problem.

It also means that the interchange format stops being JSON and becomes this-thing-that-looks-a-lot-like-json-but-has-subtlety-different-semantics, and it would cease to work on any JSON tools.

To me the way to solve the date thing is to make that, too, a data type of its own. (Meaning, at the serialization level, there's something that explicitly says "this is a datestamp", and in the shell there's a clear concept of what that means and how to work with it.) To me it's not as complicated a question as the permissions field thing, in part 'cause, unlike an octal number, it's hard to do anything too useful with a date field (such as comparison, subtraction/addition, time zone adjustment, etc.) without it being wrapped in a class.

The date/time type is something I'm planning on. But I disagree with the second part of what you say there. For instance, it's not unreasonable to store a time as just a timestamp (offset from 1970 or 1900 or whatever) -- and then as a number it's perfectly easy to do comparison, subtraction, addition, and -- with additional metadata about what timezone it is currently -- timezone adjustment. :-) (I'll have to check out what the range and precision of that is.)

Yet another option would be to explicitly select the renderer at the tail end of the pipeline. I'm not as fond of this idea because when the tail end of the data stream is headed to the user's console, there should be a good default, and I'd like to make that default as good as possible.

This is what I'm doing now, but I plan on it being only a temporary measure for the most part. I don't really like a lot of the "are you working on a TTY or pipe?" autodetection that goes on with current tools so I'm not a fan of the idea of doing the same thing to detect the end of the pipeline, but it's probably for the best (at least until I write my own shell itself, probably around 2030 :-)).

As for picking between table or some other format, my current plan is to just sort of try it out for a bit and see what's needed.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Sat Aug 11, 2012 9:14 am UTC

EvanED wrote:The date/time type is something I'm planning on. But I disagree with the second part of what you say there. For instance, it's not unreasonable to store a time as just a timestamp (offset from 1970 or 1900 or whatever) -- and then as a number it's perfectly easy to do comparison, subtraction, addition, and -- with additional metadata about what timezone it is currently -- timezone adjustment. :-) (I'll have to check out what the range and precision of that is.)


Everything except the ability to look at it and know what year/month/day/etc. it represents. :) Representing a date with a non-date-specific type means it'll be legible, or computable - to get both you need to wrap it in a class.

I get your point about not going to octal in JSON. I'm really just exploring possibilities.

This is what I'm doing now, but I plan on it being only a temporary measure for the most part. I don't really like a lot of the "are you working on a TTY or pipe?" autodetection that goes on with current tools so I'm not a fan of the idea of doing the same thing to detect the end of the pipeline, but it's probably for the best (at least until I write my own shell itself, probably around 2030 :-)).


I think you've misunderstood what I've proposed, a bit. The programs you're running don't need to detect the end of the pipeline (esp. since the program at the end of the pipeline probably doesn't know anything too specific about the data... That is, "ls" might be programmed in a way that it's good at formatting its own output for a TTY, but "sort" wouldn't be written to pretty-print "ls" output) the shell can do it.

In powershell, from what I understand of it, decisions like this are generally explicitly written out by the user - the pretty printer is the last program in the pipeline. But the shell could potentially take on some of those responsibilities itself, if the stream includes metadata that describes how the output should be displayed, or what program on the system should be invoked as a default pretty-printer.

(Though things like that - stream metadata indicating programs that should be run to format display of values - it's a useful way to do things but it could be dangerous, too... I've given some thought to process jails as a way to make things like that safer, but I think it's a domain where I have to tread lightly.)
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

Derek
Posts: 2181
Joined: Wed Aug 18, 2010 4:15 am UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby Derek » Sat Aug 11, 2012 9:40 am UTC

la11111 wrote:It's quite possible to do things like use '>' as both a redir and a boolean operator (and not a bad idea, IMO). As a boolean operator, '>' would be nested within a test, which would then be nested within a statement. The '>' redir operator is a statement delineator, equivalent to semicolon. So from the perspective of a parser, there's three levels of scope separating them.

Oh I'm sure someone could still come up with a particularly bad case for this to cause a collision. Keep in mind that the *nix systems allow any character in the filename except null and /.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Sun Aug 12, 2012 5:26 am UTC

Derek wrote:
la11111 wrote:It's quite possible to do things like use '>' as both a redir and a boolean operator (and not a bad idea, IMO). As a boolean operator, '>' would be nested within a test, which would then be nested within a statement. The '>' redir operator is a statement delineator, equivalent to semicolon. So from the perspective of a parser, there's three levels of scope separating them.

Oh I'm sure someone could still come up with a particularly bad case for this to cause a collision. Keep in mind that the *nix systems allow any character in the filename except null and /.


That's not really the issue to worry about. Whether the angle brackets are used for redirection or comparison, or both, the character can still be used in filenames via quoting, in the same way that it could be done presently. It is a potential issue when new syntax would force users to quote things they wouldn't normally expect to need quotes, but since the angle brackets are already "special" within the shell, they're already characters that people are used to quoting when using them as part of a filename.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

Derek
Posts: 2181
Joined: Wed Aug 18, 2010 4:15 am UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby Derek » Sun Aug 12, 2012 6:20 am UTC

tetsujin wrote:
Derek wrote:
la11111 wrote:It's quite possible to do things like use '>' as both a redir and a boolean operator (and not a bad idea, IMO). As a boolean operator, '>' would be nested within a test, which would then be nested within a statement. The '>' redir operator is a statement delineator, equivalent to semicolon. So from the perspective of a parser, there's three levels of scope separating them.

Oh I'm sure someone could still come up with a particularly bad case for this to cause a collision. Keep in mind that the *nix systems allow any character in the filename except null and /.


That's not really the issue to worry about. Whether the angle brackets are used for redirection or comparison, or both, the character can still be used in filenames via quoting, in the same way that it could be done presently. It is a potential issue when new syntax would force users to quote things they wouldn't normally expect to need quotes, but since the angle brackets are already "special" within the shell, they're already characters that people are used to quoting when using them as part of a filename.

That's not what I mean, although it is one potential issue. Here's an example, though I don't know exactly what kind of syntax you're imagining. "if a > 0" What should this command do? It naturally looks like a comparison, but what if I have a program called "if", and I'm passing it an argument "a", and redirecting it to output file "0". For most reasonable syntaxes I can imagine using, such a command could be constructed. For example, if you wanted to add parentheses or braces, I could just make those part of the file name.

Actually, what do existing shells do about this? Windows is stricter about characters that can't be used in file names (\/:*?"<>| are all banned), but *nix really allows almost anything. Anyone want to test this?

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Sun Aug 12, 2012 6:47 am UTC

You could name a command 'if' and the shell would still treat 'if whatever' as a conditional. In such cases you can run the command by specifying it as a path (e.g. ./if or /usr/bin/if) or by saying "command if whatever".

I didn't test this, but I'm pretty sure that's what the behavior is.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Sun Aug 12, 2012 8:15 am UTC

Derek wrote:That's not what I mean, although it is one potential issue. Here's an example, though I don't know exactly what kind of syntax you're imagining. "if a > 0" What should this command do? It naturally looks like a comparison, but what if I have a program called "if", and I'm passing it an argument "a", and redirecting it to output file "0". For most reasonable syntaxes I can imagine using, such a command could be constructed. For example, if you wanted to add parentheses or braces, I could just make those part of the file name.

Actually, what do existing shells do about this? Windows is stricter about characters that can't be used in file names (\/:*?"<>| are all banned), but *nix really allows almost anything. Anyone want to test this?


What la11111 was suggesting was a kind of modal syntax - in order to use angle brackets as comparison operators, you'd have to be in the "comparison mode". In this mode, presumably, you'd be unable to perform redirection at all.

So it'd go like this... At an ordinary command prompt:

Code: Select all

$ some_cmd > filename


That would be a redirect. But if you did this in the context of a test:

Code: Select all

ns performed within are done according to "test syntax"
$ a = 4
$ if [ $a < 15 ]; then echo hello; fi     #  That "<" is not redirection!
hello


In this case, '<' is a comparison operator rather than a redirection operator. Within this imaginary example, '[' is not a command as it is in bash - rather it is true syntax, and within the square brackets, the shell parses the command with "test syntax" - meaning the angle brackets have a different function and possibly different precedence as well.

If you were to do that in bash, the "< 15" would be treated as a redirection - it would error out if the shell can't open the file "15" - and if it can open the file "15" the test won't do what the user wants:

Code: Select all

# in bash...
$ touch 15
$ a = 40
$ if [ $a < 15 ]; then echo hello; fi
hello
$ echo [ $a < 15 ]
[ 40 ]
# The shell took "< 15" as a redirection, and then took the following token, ']', as another argument to "echo"
# Also note that, in bash, when you write something like "echo [ $a -lt $b ]" it doesn't perform a comparison.
# When "[" is run as a command with ($a, "-lt", $b, "]") as its arguments, it performs a test.
# Constructs like "if" and "while" run whatever command follows them and operate on the return value.


Shifting between syntax modes like this could work, but I'm not fond of this approach because I think it's confusing to change the rules of the syntax so casually. What if someone wants to include a redirection as part of their test? Do they then have to get out of "test syntax"? Or "escape" the comparison operator somehow to make it a redirect operator again? I think it's better, to the extent possible, to keep the syntax consistent in different contexts.

(EDIT): As for the whole "angle brackets can be part of filenames" thing - well, yes. But if the character is syntax, then you can escape it when providing a command argument, to avoid it being treated as syntax:

Code: Select all

$ echo foo\>bar    # ">" is syntax for redirection, we don't want redirection to happen, we just want an angle bracket as part of our command argument.
$ echo "foo>bar"    # Likewise...
$ touch "foo>bar"   # These work for filenames, too.  With double-quotes and backslashes you can get around anything.  It just won't necessarily be pretty...


In bash, "[" for tests isn't syntax, it's actually a command (/bin/[ - though it's a builtin so the one in /bin is mostly used by simpler shells.) As previously noted, this has some implications... "[" is a command, so the test only happens if "[" is in command position. There has to be whitespace around "[" and "]" because the shell won't treat them as separate tokens otherwise. (And because [] is part of bash globbing syntax - so the whitespace serves to tell the shell that it's not a glob and to leave the characters alone) And the "]" doesn't actually inform the shell that the test syntax is over - hence, you need a semicolon after the "]" so the shell knows where the test ends and the "then" clause is likely to begin...

Personally, I think all those factors make the square-bracket test "syntax" in the Unix shell pretty confusing:

Code: Select all

$ if [-e filename]; then echo exists; fi
bash: [-e: command not found
# Huh??  No, I understand why it says that, "if" is followed by a comand, and without whitespace, the command name is "[-e" - but it's confusing because "[]" looks like syntax.  I'd expect it to be treated as a separate token.
$ if [ -e filename ] then echo exists; fi
bash: syntax error near unexpected token 'fi'
# The error, of course, is right after the square brackets. No semicolon, so "then echo exists" is turned into three more arguments to the "[" command:
# "[" "-e" "filename" "]" "then" "echo" "exists"
# Bash thinks the error is at "fi" because "if" has to be followed by a test, and then by "then".  But here, it's the test and then immediately "fi".
$ echo [ -e filename ]     #Bash, I want you to print "true" or something...
[ -e filename ]
# Bash tests don't work that way...  the test has to be run as a command.  It's not syntax, it gets no special treatment from the shell.


Honestly, I find it tougher than you might expect to avoid this stuff in my own design. For instance, I mentioned before that my shell design doesn't have arithmetic syntax, but I intend to fake it by allowing numeric variables to take arguments as a way to trigger an evaluation:

Code: Select all

$ 5 + 8
13
# the number 5 was called with the arguments "+" and (8).  The numeric class recognized that as an arithmetic evaluation and returned the result.

But I run into the same problems - whitespace sensitivity and so on...
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Tue Aug 14, 2012 7:33 am UTC

I gave the "futureix" stuff a try. Well, not a lot to try at the moment but it was interesting to take a look.

Maybe not worth reporting bugs at this point, but:

Code: Select all

$ mkdir a; cd a; touch x y z; mkdir b; touch b/n1 b/n2; cd ..
$ ./list_directory ./*
{"-meta type": {"type name": "file info", "default display columns": ["name", "path", "kind"]}, "kind": ["regular file"], "in-directory": "a/b", "inode": 1630986, "name": "n1"}
{"-meta type": {"type name": "file info", "default display columns": ["name", "path", "kind"]}, "kind": ["regular file"], "in-directory": "a/b", "inode": 1630987, "name": "n2"}
{"-meta type": {"type name": "file info", "default display columns": ["name", "path", "kind"]}, "kind": ["directory"], "in-directory": "a/b", "inode": 1630981, "name": ".."}
{"-meta type": {"type name": "file info", "default display columns": ["name", "path", "kind"]}, "kind": ["directory"], "in-directory": "a/b", "inode": 1630985, "name": "."}
a/x
a/y
a/z


Files a/x, a/y, and a/z match the glob but aren't directories, so the program spits out invalid JSON...
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Tue Aug 14, 2012 3:27 pm UTC

Huh. OK, so I checked it out and I know what's causing it, but I don't really know what the right way to fix it is yet. I'll think about it.

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Tue Aug 14, 2012 6:32 pm UTC

Just wanted to stop by and say hi. I've been lurking on this thread for awhile... on account of being in the middle of such a rewrite myself. Though I'm doing it in Go. Currently, I've got several utilities "functional," but I haven't started work on making them different (wanting to solve the general problem before I work on making the i/o more interesting.)

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Wed Aug 15, 2012 9:54 pm UTC

tomCar wrote:Just wanted to stop by and say hi. I've been lurking on this thread for awhile... on account of being in the middle of such a rewrite myself. Though I'm doing it in Go. Currently, I've got several utilities "functional," but I haven't started work on making them different (wanting to solve the general problem before I work on making the i/o more interesting.)


What sort of direction do you want to take your project? When you say you haven't started work on making them different - what sort of differences do you envision? What do you have in mind for the I/O?
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Fri Aug 17, 2012 10:33 pm UTC

tetsujin wrote:
tomCar wrote:Just wanted to stop by and say hi. I've been lurking on this thread for awhile... on account of being in the middle of such a rewrite myself. Though I'm doing it in Go. Currently, I've got several utilities "functional," but I haven't started work on making them different (wanting to solve the general problem before I work on making the i/o more interesting.)


What sort of direction do you want to take your project? When you say you haven't started , User is the user's name,work on making them different - what sort of differences do you envision? What do you have in mind for the I/O?


I'm probably going to go the same route as EvanED, with JSON passed between utilities. Go can marshal to/from JSON so it's really just a matter of making structures to contain the information. Since the same function also handles marshaling to a "generic" structure, I just have to add a section that handles generic objects.
So for example, ls might output a JSON object with 5 fields (I know coreutils has more, but I don't really feel like going outside Go's entirely portable stdlib to get the rest of them;) Name, Mode, User, Bytes, ModTime. Mode is the full -rwxrwxrwx notation and ModTime is in RFC3339. Translated into Go (with two helper methods to help formatting the time and size, note I haven't arranged to test this yet it's just an idea for now):

Code: Select all

package fileinfo

import (
  "fmt"
  "time"
)

type FileInfo struct {
  Name string
  Mode string
  User string
  Bytes float64
  ModTime string
}

func (f FileInfo) Date() (ret string) {
  t, err := time.Parse("RFC3339", f.ModTime)
  if err != nil {
    panic(err.Error())
  }
  ret += t.Month().String()[:3] + " "
  ret += fmt.Sprintf("%2d", t.Day()) + " "
  today := time.Now()
  if t.Year() == today.Year() {
    hour, min, _ := t.Clock()
    ret += fmt.Sprintf("%2d:%02d", hour, min)
  } else {
    ret += fmt.Sprintf("%5d", t.Year())
  }
  return
}

func (f FileInfo) Size() (ret string) {
  s := f.Bytes
  i := 0
  for ; i < 9; i++ {
    if s < 1 {
      s *= 1000
      break
    }
    s /= 1000
  }
  i--
  switch i {
    case 1: ret = fmt.Sprintf("%gKB",s)
    case 2: ret = fmt.Sprintf("%gMB",s)
    case 3: ret = fmt.Sprintf("%gGB",s)
    case 4: ret = fmt.Sprintf("%gTB",s)
    case 5: ret = fmt.Sprintf("%gPB",s)
    case 6: ret = fmt.Sprintf("%gEB",s)
    case 7: ret = fmt.Sprintf("%gZB",s)
    case 8: ret = fmt.Sprintf("%gYB",s)
    default: ret = fmt.Sprintf(" %gB",s)
  }
  return
}


*EDIT*
One thing I'm worried about is that not all of the utilities actually need or should have full JSON support. cat for example just takes file names and outputs their concatenated contents, JSON would only complicate this. So we'd have this situation where a bunch of utilities output in JSON and a bunch of others which don't, it might be confusing (especially to new-comers.)

Yet another case, if the stat-ing is taken out of ls, then there's no reason to retain the full JSON notation in ls either. And perhaps more to the point stat could probably suffice with a CSV output. sort would of course be more tricky though. If files are uniformly formatted most of the issues go away... it's just that this is a very bad assumption (and I haven't played with current sort implementations enough to know how it handles it.) Ideally, I could also specify that field 3 is a month and do that type of comparison... but I'm not sure how that invocation would end up looking.
Maybe:
sort +M3

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Mon Aug 27, 2012 12:39 am UTC

tomCar wrote:*EDIT*
One thing I'm worried about is that not all of the utilities actually need or should have full JSON support. cat for example just takes file names and outputs their concatenated contents, JSON would only complicate this. So we'd have this situation where a bunch of utilities output in JSON and a bunch of others which don't, it might be confusing (especially to new-comers.)


EvanED seems to handle this in his design by putting each value on a separate line, and making each line a stand-alone JSON value. So you could run the stream through an ordinary line-oriented text filter like grep or head, and the output of the filter can stil be used by the line-oriented JSON utilities. I don't know if the ultimate plan is to use line-oriented text processors and line-oriented JSON processors together, but for now that certainly seems like a possibility.

My plan for dealing with this issue is to keep track of which programs deal in which kind of data, and enforce compatibility rules, or auto-convert between formats under certain circumstances. "Value streams" would be fundamentally incompatible with "raw streams", and so would require the user to provide an explicit conversion step. So existing programs would still work in the shell just fine, and they could be chained together like always - they just wouldn't be able to directly interact with the structured data streams that "native" programs pass around.

The problem of retaining all these existing tools within a new and fairly different environment is a challenging one. I don't think there's going to be a truly palatable way forward. So what I aim for is to make it easy for people to keep doing the things they already do, but also make it easy and advantageous to mix that with the new stuff that I'm going to provide. It's still not a perfect solution: if I implement my own version of "find", for instance, then people who type "find" expecting to find a POSIX-style implementation will be annoyed. I can invent syntax or whatever to make both "find"s accessible, but whatever I do, one "find" or the other will suffer in the new environment. If I make the new "find" the one that's run when the user types "find", then the new environment is that much less comfortable for people not interested in delving too deep into the features of the new environment. If the new one is shuffled to a different name or invoked by a special syntax to avoid the conflict, then the new features of the environment are hindered. It strikes me as a bit of a no-win scenario - faced with that, I think the best bet is to bite the bullet, accept that it'll be slightly awkward to use some existing programs in the new environment, and make the new bits as good as possible.

Yet another case, if the stat-ing is taken out of ls, then there's no reason to retain the full JSON notation in ls either. And perhaps more to the point stat could probably suffice with a CSV output. sort would of course be more tricky though. If files are uniformly formatted most of the issues go away... it's just that this is a very bad assumption (and I haven't played with current sort implementations enough to know how it handles it.) Ideally, I could also specify that field 3 is a month and do that type of comparison... but I'm not sure how that invocation would end up looking.
Maybe:
sort +M3


Remember that filenames can contain "special" characters like comma, tab, or quotes. To reliably encapsulate that data, you need to establish a convention for how you delimit the boundaries of the field, and how you encode the data in the field when the field contains something that looks like a delimiter. If you have a format that can do all that, then JSON isn't much of a stretch beyond that - so I think there's no point going to a simpler format for ls - JSON is simple enough, but still robust enough to handle anything that will come out of ls.

As for the sort issue - I figure the thing to do is (to the extent possible) make sure that the fields are already encoded in such a way that there's an obvious and appropriate default for how the field would be sorted. So numbers are sorted as numbers (in particular, they're not sorted as strings, we don't want (10<2) to be true...) - dates are stored as dates, and sorted by the order in which they occur.

For cases where this isn't enough, the sort utility can provide additional options: which fields to sort in what order, custom chunks of code for performing element comparisons, and so on.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Mon Aug 27, 2012 3:31 am UTC

tetsujin wrote:
tomCar wrote:*EDIT*
One thing I'm worried about is that not all of the utilities actually need or should have full JSON support. cat for example just takes file names and outputs their concatenated contents, JSON would only complicate this. So we'd have this situation where a bunch of utilities output in JSON and a bunch of others which don't, it might be confusing (especially to new-comers.)


EvanED seems to handle this in his design by putting each value on a separate line, and making each line a stand-alone JSON value. So you could run the stream through an ordinary line-oriented text filter like grep or head, and the output of the filter can stil be used by the line-oriented JSON utilities. I don't know if the ultimate plan is to use line-oriented text processors and line-oriented JSON processors together, but for now that certainly seems like a possibility.

My plan for dealing with this issue is to keep track of which programs deal in which kind of data, and enforce compatibility rules, or auto-convert between formats under certain circumstances. "Value streams" would be fundamentally incompatible with "raw streams", and so would require the user to provide an explicit conversion step. So existing programs would still work in the shell just fine, and they could be chained together like always - they just wouldn't be able to directly interact with the structured data streams that "native" programs pass around.

The problem of retaining all these existing tools within a new and fairly different environment is a challenging one. I don't think there's going to be a truly palatable way forward. So what I aim for is to make it easy for people to keep doing the things they already do, but also make it easy and advantageous to mix that with the new stuff that I'm going to provide. It's still not a perfect solution: if I implement my own version of "find", for instance, then people who type "find" expecting to find a POSIX-style implementation will be annoyed. I can invent syntax or whatever to make both "find"s accessible, but whatever I do, one "find" or the other will suffer in the new environment. If I make the new "find" the one that's run when the user types "find", then the new environment is that much less comfortable for people not interested in delving too deep into the features of the new environment. If the new one is shuffled to a different name or invoked by a special syntax to avoid the conflict, then the new features of the environment are hindered. It strikes me as a bit of a no-win scenario - faced with that, I think the best bet is to bite the bullet, accept that it'll be slightly awkward to use some existing programs in the new environment, and make the new bits as good as possible.

Conversion isn't that difficult and does allow for each tool to be as simple as necessary (e.g. cat.) However, it does mean that a new-comer will have to learn and remember what format each tool outputs. With a small number of formats this isn't a big deal, but it is an impediment to adoption. Of course, as you say the new tools will behave differently and that is already an impediment to adoption.

Of course, I could make all tools uniform and have all i/o be JSON. Which collectively is elegant, but individually ugly:

Code: Select all

$cat file
[{"name":"file", "content":"This is a file./nI go on/nand on/nand on."}]

grep has a few options with how to deal with this: field, field+lines, lines, or objects (ideally lines + objects would be the same.)

I'm not inclined to make the old tools available. I'd just as soon make each tool either a.)not care about old formats b.)recognize and suggest or c.)recognize and auto-correct. a is easiest for me and c is easiest for users. b makes sure that no unexpected behavior of occurs (in case the correction is inaccurate,) however, if it's accurate users might complain that "it knows what I want so why doesn't it do it."

Interestingly, as I talk about it the original implementations seem more and more reasonable. I'm now more or less of the opinion that lines/_SV (CSV,TSV,...) might be the best format and that the real issue lies in tools being inconsistent or inflexible. Of course, it could just be that the Plan 9 utilities are just better than the GNU ones (and thus I'm less inclined to make a radical change.)
Yet another case, if the stat-ing is taken out of ls, then there's no reason to retain the full JSON notation in ls either. And perhaps more to the point stat could probably suffice with a CSV output. sort would of course be more tricky though. If files are uniformly formatted most of the issues go away... it's just that this is a very bad assumption (and I haven't played with current sort implementations enough to know how it handles it.) Ideally, I could also specify that field 3 is a month and do that type of comparison... but I'm not sure how that invocation would end up looking.
Maybe:
sort +M3


Remember that filenames can contain "special" characters like comma, tab, or quotes. To reliably encapsulate that data, you need to establish a convention for how you delimit the boundaries of the field, and how you encode the data in the field when the field contains something that looks like a delimiter. If you have a format that can do all that, then JSON isn't much of a stretch beyond that - so I think there's no point going to a simpler format for ls - JSON is simple enough, but still robust enough to handle anything that will come out of ls.

As for the sort issue - I figure the thing to do is (to the extent possible) make sure that the fields are already encoded in such a way that there's an obvious and appropriate default for how the field would be sorted. So numbers are sorted as numbers (in particular, they're not sorted as strings, we don't want (10<2) to be true...) - dates are stored as dates, and sorted by the order in which they occur.

For cases where this isn't enough, the sort utility can provide additional options: which fields to sort in what order, custom chunks of code for performing element comparisons, and so on.

Filenames are handled by the current shells just fine. Space separated with any filename containing a space quoted (and further quotes escaped.) As far as names in a line in a file... for line based tools you simply aren't allowed to put a newline within one entry and CSV has specifications similar to the shell.
At least as far as ls goes I believe CSV is robust enough to handle it (ls currently looks like TSV.) The issue with ls is that there's a lot of information in the output and sort isn't really able to handle it all.

Number detection is easy enough, but what about month names? I might want to sort them alphabetically or chronologically. Similarly, some date formats work well with string sorting (YYYY/MM/DD) but some don't (e.g. if month or day doesn't have to be 2 digits, "12" comes before "8" for example.) I'm definitely of the opinion that sort needs to have custom field orders and custom field type specifications (number, month, date? <-8/26/2012 or 26/8/2012 or 2012/8/26?) The question is how? Maybe a CSV flag:
sort --type=MM/DD/YYYY,NUM,STR --order=1,3,2

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Mon Aug 27, 2012 8:07 am UTC

My point about special characters in filenames is that, whiele the current shells handle that with no problem in their syntax, handling that kind of case in a pipeline is more problematic, and one of the scenarios that keeps coming up as we talk about why pipelines should adhere to a common serialiazation format in the first place. If your format supports quoting rules sufficient to encode reliably the different characters that may appear in a string... The at that point, it seems to me you're most of the way toward using JSON anyway, and you may as well go the rest of the way so at least you have something to point at and say, "this is the serialization format being used here."

The trick with newcomers needing to know what format each tool outputs is that... mostly they don't. In the cases where they do, static type-checking gives them a helpful message and their (malformed) job doesn't run.
I figure there's three basic levels of support a program can implement for the shell's data model:
First is true "native support" - which means the program does its I/O a format defined for the shell, or something that has an equivalent level of expressive power... Like a JSON or XML encoding that supports the concepts of the shell's data model.
Second level would be "wrapped" programs: programs that do I/O in a format that suits them - but the program's installation provides info the shell can use to translate the strean to other formats. For instance, a program may write out a list of newline--delimited strings, and when the shell runs that program in a pipeline, it would know to insert a format translator into the pipeline, to scan for newlines and output a properly-delimited string list.
Third level is "raw" - basically the shell knows nothing about these programs' I/O disciplines and so doesn't let them play with non-raw stuff

So for both first and second levels, the program is set up such that the user doesn't need to think about how the program does I/O. It's just that raw boundary that bites people

As for "allowing" pre-existing utilities... depends on if you want people to use the shell. If the shell won't run the programs people already like, they won't use the shell.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Mon Aug 27, 2012 10:31 am UTC

tetsujin wrote:My point about special characters in filenames is that, whiele the current shells handle that with no problem in their syntax, handling that kind of case in a pipeline is more problematic, and one of the scenarios that keeps coming up as we talk about why pipelines should adhere to a common serialiazation format in the first place. If your format supports quoting rules sufficient to encode reliably the different characters that may appear in a string... The at that point, it seems to me you're most of the way toward using JSON anyway, and you may as well go the rest of the way so at least you have something to point at and say, "this is the serialization format being used here."

The trick with newcomers needing to know what format each tool outputs is that... mostly they don't. In the cases where they do, static type-checking gives them a helpful message and their (malformed) job doesn't run.
I figure there's three basic levels of support a program can implement for the shell's data model:
First is true "native support" - which means the program does its I/O a format defined for the shell, or something that has an equivalent level of expressive power... Like a JSON or XML encoding that supports the concepts of the shell's data model.
Second level would be "wrapped" programs: programs that do I/O in a format that suits them - but the program's installation provides info the shell can use to translate the strean to other formats. For instance, a program may write out a list of newline--delimited strings, and when the shell runs that program in a pipeline, it would know to insert a format translator into the pipeline, to scan for newlines and output a properly-delimited string list.
Third level is "raw" - basically the shell knows nothing about these programs' I/O disciplines and so doesn't let them play with non-raw stuff

So for both first and second levels, the program is set up such that the user doesn't need to think about how the program does I/O. It's just that raw boundary that bites people

As for "allowing" pre-existing utilities... depends on if you want people to use the shell. If the shell won't run the programs people already like, they won't use the shell.

I'm not actually planning any shell/pipeline rewrites (for now.) So I'm still using "raw." It's simply that I want individual tools to be more consistent. Which is why I worry about formatting, it's still important to me.

By "allowing" I was more talking about the format/flags that old utilities used. I'd really rather the user not have the old utilities, but as it is now, if you mess with the paths you can have multiple tool set on your system (I have GNU and Plan 9 tools on my system.) I'm not certain that I want to retain compatibility with older tools or even provide suggestions as to what my invocation would look like (as mentioned, the former has issues if it's not accurate and the later gets complaints if it is.) Meaning, I would rather keep old format/flag info. away from my stuff.

Of course, this does put me a little off from the central idea of this thread... but whether it's pipeline or utility based format change there's a lot of cross-over and interesting ideas to be had all around.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Mon Aug 27, 2012 7:54 pm UTC

Also, I want to address this...

tomCar wrote:Of course, I could make all tools uniform and have all i/o be JSON. Which collectively is elegant, but individually ugly:

Code: Select all

$cat file
[{"name":"file", "content":"This is a file./nI go on/nand on/nand on."}]

grep has a few options with how to deal with this: field, field+lines, lines, or objects (ideally lines + objects would be the same.)


There's no need for the metafield "name" there. "cat" is only expected to provide the content from the files. So if we take the standpoint that the output of any "native" (to the new shell) program is going to output a "list of values", then we can write the output of "cat" in a couple different ways (in JSON) - all of which have relatively low overhead:

1: output stream is a newline-separated list of single-line JSON values (evanED's serialization style, with the whole file contents as a single "value"):

Code: Select all

$ cat file
"This is a file./nI go on/nand on/nand on."


2: output stream as a whole is a JSON list:

Code: Select all

$ cat file
["This is a file./nI go on/nand on/nand on."]


A third option is that cat's output isn't encoded at all, and that it simply yields the raw contents of the file (which may themselves be an encoded stream) for another program to consume:

Code: Select all

$ cat file
This is a file.
I go on
and on
and on.


In all cases (except #3) we have the somewhat unfortunate overhead of syntax wrapping the file's contents, as well as various characters translated to escape-character syntax. From a more classic perspective, this is pure waste. After all, "cat" never needs to output multiple "values" (if you specify multiple filenames, cat is supposed to concatenate them together) - its output is always a single text stream, and you can tell when the stream ends by looking for an EOF condition on the pipe. (read(pipe, ...) -> 0) This waste is justified by the fact that getting all the core utilities to speak the same "language" makes it easier to get those programs to work together.

"cat" is an extreme example, however. A more typical case (and this is one I go back to again and again, it seems!) is "find". Though really any program that outputs a list of strings would work as an example. But there are reasons why "find" makes such a great example... It's not an artificial example, it's something people actually use - but the contents of its output are dictated by the files present on the disk... And while it's common (but not universal!) practice in Unix to avoid whitespace in filenames, there are times when we do wind up with filenames including whitespace: files we download from people who don't follow those conventions, filesystems we access over the network, or filesystems from other OS'es we may have set up for dual-boot. So that leaves us with the question of how we delimit those filenames in the output of a program like "find".

The default output of "find" is newline-delimited: which fails if a filename contains a newline (which can happen but I think it's very rare in practice.) This would be quite sufficient for most purposes, except that the shell doesn't entirely support that...

Code: Select all

$ for f in $(find); do ... ; done   # The shell will split the output of $(find) on $IFS, which is any whitespace by default.  You can't set $IFS to a newline and have it work, unfortunately...
$ find | while read f; do ... ; done    #This will work, however.
$ find | xargs -d "\n" cmd             # I think not all versions of xargs handle this properly...  xargs also applies some quoting rules, I think, which impact how it handles whitespace - so this can still fail:
$ touch '"a b"'; ls
"a b"
$ find | xargs -d "\n" echo   # Quoting rule kicks in, so we see...
. ./a b
# No quotes!  See, this is the problem with ad-hoc serialization rules...


And for those cases where it's not sufficient, there's "-print0" which uses zero-bytes as the delimiter. The main problem with that option is that it's not present in all implementations of find and xargs - and not in many other tools either, it seems...

Code: Select all

$ find -print0 | xargs -0 echo
. ./"a b"
#Yes!  Perfect!  There's also "grep -Z" (though the option doesn't seem to actually work on my machine), "perl -0", "sort -z", and maybe some others.
$ find -print0 | while read -0 f; do ...; done    # Doesn't work!  Shell's "read" doesn't have a -0 option!
$ IFS="\0"; for f in $(find -print0); do ...; done   # Again, doesn't work!  Can't set $IFS to the null byte.


If the convention for using null-byte as a delimiter in data streams were universally adopted in the shell, that would improve the overall situation quite a lot. The main limitation is that data streams wouldn't be able to contain the null-byte as part of the value... Which usually isn't a problem if you're writing text... There are other low-ASCII characters which could be used as well, except that they do vary from platform to platform and it's dangerous to assume that no one would ever want one as part of the payload....

And that's the great thing about encoding the value streams: you can have anything as part of the value payload and still know, reliably, where the boundary between values lies.

For most line-oriented utilities, respecting a common format like JSON wouldn't be a big deal: a lot of them are halfway there already (because they support some form of quoting syntax in their I/O) - the only thing that's missing is that they haven't all agreed on a single I/O meta-format for basic things like delimiting values.

And that's why even the simplest programs should encode their results in JSON or whatever other meta-format is used as the shell's value representation. We accept a bit of waste in order to get everything working together nicely.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Mon Aug 27, 2012 8:43 pm UTC

tetsujin wrote:snip cat stuff

I was making it purposefully ugly, I was not suggesting that as what cat should look like.
In all cases (except #3) we have the somewhat unfortunate overhead of syntax wrapping the file's contents, as well as various characters translated to escape-character syntax. From a more classic perspective, this is pure waste. After all, "cat" never needs to output multiple "values" (if you specify multiple filenames, cat is supposed to concatenate them together) - its output is always a single text stream, and you can tell when the stream ends by looking for an EOF condition on the pipe. (read(pipe, ...) -> 0) This waste is justified by the fact that getting all the core utilities to speak the same "language" makes it easier to get those programs to work together.
[\quote]
Which may suggest that cat doesn't belong to the common format... instead it may be better to insure that generated files/pipes are already in the appropriate format. In which case the format should be concatenate-able. _SV works best for this, but JSON and XML also work (XML's root nodes must be merged, so it's not a true concatenation, but close enough.)
And that's why even the simplest programs should encode their results in JSON or whatever other meta-format is used as the shell's value representation. We accept a bit of waste in order to get everything working together nicely.

I'm not suggesting otherwise. I just want to make sure that any waste that occurs is necessary waste (and that the chaos of formats we currently have is avoided.) Personally, I think (C|T)SV is the most minimal format that meets the requirements. Further, TSV is already sort of used (ls and sort know about it;) it should be a relatively simple matter to make TSV the formal format.
Last edited by tomCar on Tue Aug 28, 2012 5:17 pm UTC, edited 1 time in total.

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Tue Aug 28, 2012 9:06 am UTC

I think there's not much to be gained from going with something simpler than JSON, and probably something to be lost. Getting reliable delimitation of values is great, but that's pretty much all you get... And while it's great that two or three utilities already support CSV, it doesn't amount to much. Two or three utilities support null-delimited values, and they're not the same ones. There's probably another format or two that fit the pattern. Point is, that still leaves the majority unaddressed - it doesn't save you all that much work. Particularly for a format like JSON that's already widely supported in different programming languaages.
Personally I don't think a minimalist format is the thing to shoot for. If we're to reimplement the shell environment and base it around a common interchange format, better to take advantage of the potential there, choose something that broadens the shell's capabilities in addition to simplifying pipeline programming. I'm thinking type system, nested structures, metadata, the ability to handle chunks of binary data, etc
Rather than thinking about whether waste is "necessary" I think it's better to consider what you get for it. This is the basic philosophy that justifies XML, as I understand it... The format's wasteful, but in exchange you get lots of elbow room to make backward-compatible improvements to a file format you've defined within the bounds of XML. If you think of that in terms of the shell, you could define the output of a program in such a way that if you add another field to that output later, your existing scripts will still work.
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Tue Aug 28, 2012 8:33 pm UTC

tetsujin wrote:I think there's not much to be gained from going with something simpler than JSON, and probably something to be lost. Getting reliable delimitation of values is great, but that's pretty much all you get... And while it's great that two or three utilities already support CSV, it doesn't amount to much. Two or three utilities support null-delimited values, and they're not the same ones. There's probably another format or two that fit the pattern. Point is, that still leaves the majority unaddressed...

It's actually the reverse case. Most utilities are already nearly CSV compliant. It's just a matter of changing delimiters and/or properly escaping the output. CSV is also fairly widely available, and if not it's trivial to parse/generate.

CSV can easily accomplish meta or binary data.

Code: Select all

00011011,2
FF,16

Code: Select all

[{"number":"00011011","base":2},
{"number":"FF","base":16}]

Code: Select all

<root>
    <number value=00011011 base=2 />
    <number value=FF base=16 />
</root>

Naturally, the JSON and XML versions could be shortened. However, they still wouldn't be as short as the CSV version which represents the exact same thing (a binary number.)
Nested structures are also possible, however, it's a mess and nesting just isn't useful enough to bother. Especially since relational structures accomplish the same goal (and are easily expressed in CSV.)
CSV is extensible. If I wanted to I could add another field, it would just have to come last (since CSV uses location to determine references, where as JSON and XML use names to find things.)

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Wed Aug 29, 2012 2:14 pm UTC

However, you can't remove a field, and you can't create a stream of heterogeneous types as each row isn't self-describing. (I guess you could allow multiple header rows, and the latest seen is what applies.)

Not worth the space savings IMO when you lose several other benefits (even given the multiple header rows idea).

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Wed Aug 29, 2012 6:43 pm UTC

EvanED wrote:However, you can't remove a field, and you can't create a stream of heterogeneous types as each row isn't self-describing. (I guess you could allow multiple header rows, and the latest seen is what applies.)

Not worth the space savings IMO when you lose several other benefits (even given the multiple header rows idea).

Why would you want to remove a field? If you're filtering for output then it doesn't make a difference what format you use. If you're removing important information... you're still going to have issues in JSON and XML.

Depending on requirements, you don't even need headers for heterogeneous types. Just introduce metadata:

Code: Select all

int,123
float,123.456
string,"Hi"

Of course, you'd need some way to specify what means what, but you have to do that for JSON and XML as well.

There really aren't too many features that CSV doesn't have. As commented, nested structures are the only thing that's really hard in CSV, and you can simply do relations instead (which provides the same feature set.) I don't really mean to convince you that JSON is a bad option, just that I think CSV does the job and wastes less. Making CSV the best format for my project, which I decided on during this discussion (I started out with JSON, which as I discussed how things would work I realized I wasn't going to use any feature that JSON would add.)

User avatar
tetsujin
Posts: 426
Joined: Thu Nov 15, 2007 8:34 pm UTC
Location: Massachusetts
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tetsujin » Wed Aug 29, 2012 8:48 pm UTC

tomCar wrote:
tetsujin wrote:I think there's not much to be gained from going with something simpler than JSON, and probably something to be lost. Getting reliable delimitation of values is great, but that's pretty much all you get... And while it's great that two or three utilities already support CSV, it doesn't amount to much. Two or three utilities support null-delimited values, and they're not the same ones. There's probably another format or two that fit the pattern. Point is, that still leaves the majority unaddressed...

It's actually the reverse case. Most utilities are already nearly CSV compliant. It's just a matter of changing delimiters and/or properly escaping the output. CSV is also fairly widely available, and if not it's trivial to parse/generate.


Only if you ignore the deeper issues.

For instance, how do you embed a comma, or a newline, as part of one of your fields? The common rules are:

1: If your field contains anything funky, wrap it in double quotes.
2: If you want to put a double-quote character in a field, double it.

Now if you want to actually follow those rules, then parsing CSV is harder than just setting $IFS or whatever. You need to recognize things like the fact that a comma within quotes isn't a field separator, and a newline within quotes isn't a record separator. You can't just readline() and split(). It's still an easy format to parse, but I would argue:

1: it's not significantly easier to parse than JSON, not if you want to get it right1. That is, you need to actually parse it, rather than just reading a line at a time and scanning for commas.
2: If you use a CSV library to parse it (and, thus, get it right) - then it's no easier than using, say, a JSON library to do the same thing.
3a: the fact that implementations of CSV so frequently ignore the quoting issue means that you'll wind up in scenarios where you'll think you can just pass a chunk of data to a particular program, when in fact that program's gonna mangle the stream. (This is much less of an issue with JSON, as the format was established with most or all of those decisions already made, and there's a clear authority on decisions regarding the JSON format - go to json.org and you will see the rules that define the JSON format. Go to Wikipedia and you will see the most common rules used to define CSV - along with regional variations (like semicolons as field separator for countries that use comma as a decimal point in numbers2), inconsistently-supported rules (like all the quoting rules - the stuff that makes payload encapsulation fully-featured and robust, but which is often omitted from CSV parser implementations because people think, "Oh, comma-separated values, I'll just call split()."), would-be "standards" like RFC4180, and so on.)
3b: Alternately, if you say, "I'm just gonna use the simple subset of CSV that doesn't use quoting rules" - then you can't encapsulate quotes or newlines inside field values, and you'll choke if you process a stream written by a program that does use that syntax.

CSV can easily accomplish meta or binary data.


Well, yes, more or less: but there are a few things to consider:

1: the receiving program doesn't know it's binary data, let alone what it's supposed to represent.
2: By ASCII-encoding the payload you're inflating its size a lot. (Doubling it for a HEX encoding, or increasing its size by about 1/3 in the case of base-64)
3: If you use an encoding like base-64, then you lose a valuable property of a bytestream: the byte boundaries of the source data no longer fall on byte boundaries in the encoding. This means, for instance, if you have five bytes of source data ready to send over the pipe, you can only send four of them: The first three bytes of the payload are encoded as four bytes of base-64, then the next byte and the first four bits of the next byte are sent as the next two bytes of base-64. You can't send the next byte of base-64 (containing the remaining four bits of the fifth source byte) until you have the sixth source byte ready to go. If this doesn't sound like a big deal, consider the case where a program is streaming data it receives over the network, or over a pipeline from another program. In that case, you don't know when or even if that sixth byte is going to be available. But the next program in the pipeline may nevertheless be able to take action if it gets that fifth source byte: for instance, after seeing the fifth byte it may be prepared to close the pipe (causing a SIGPIPE on the sender next time it tries to send more data), or it may have enough information to send another piece of data to the next program in the pipe after itself.

It is possible to define an encoding within the bounds of CSV that will provide the receiver with information like the fact that the string of characters in the field represents some binary encoding of data, or that the binary blob actually should be interpreted as some media format or other. But the point is that since CSV doesn't provide even the most basic mechanisms for doing so, then to provide that functionality you'd have to essentially invent a new format, using CSV as nothing more than the base-level encoding for that stream. You would either need to get all the various utilities to recognize this common format you've implemented on top of CSV (meaning it's not really "CSV" you're establishing as the baseline format, but rather something on top of it), or else let them all remain ignorant of the relevance of that binary encoding, which is equivalent to not providing any better support for binary data than what already exists in the shell and in its associated utilities.

And if you did invent that layer on top of CSV, containing your conventions for establishing different data types, etc. - a fair bit of that work would be essentially reinventing what JSON's already providing. Things like nested structures, possibly with named fields (as a rudimentary way of attaching metadata)

Nested structures are also possible, however, it's a mess and nesting just isn't useful enough to bother. Especially since relational structures accomplish the same goal (and are easily expressed in CSV.)


I don't agree that nested structures aren't "useful enough" - and your criterion there, "useful enough to bother", is skewed by the fact that you're starting with a format that doesn't naturally lend itself to that kind of functionality.

For instance, what's "useful enough to bother" using nested structures in JSON? It's trivial. You just use 'em. Anything that processes JSON will understand that it's a nested structure, and any JSON parser will be able to tell you exactly what the correct (decoded) payload for each field is.

In CSV, it involves questions like, what character escaping mechanism are you using? What do you use as a secondary delimiter for the nested fields? How do you encapsulate the secondary delimiter within a value field? Such decisions are beyond the scope of CSV itself. Again, it's a matter of having to define a set of conventions on top of CSV to get that kind of functionality.

Expressing a nested structure by linking to other records has a few problems: for instance, you may not want to treat the nested data the same way as you treat the record containing it. If you're filtering records out of the stream, for instance, you'd want the nested data to go away if you filter out the record containing it. But if your "nested" data is really just another record referenced in the field of another record in the stream, then your filter program doesn't know that, and if the "nested" record passes the filter rules then it'll go to the next stage of stream processing.

Using nested structures in data streams isn't "useful enough to be worth the trouble" right now because it really is quite a bit of trouble. Using any kind of delimiter means either you can't have that delimiter as part of a field payload, or you need to define a syntax to work around that, and then parse the stream instead of just scanning for delimiters. To nest structures, you either need some kind of encapsulation syntax (which pretty much requires true parsing), or another delimiter for each level of nesting (which, again, either means another character you can't encode as part of a field payload, or an escape syntax that you'll need to parse.)

Working with a stream format you need to parse sucks right now because most tools don't include stream parsers. Most tools don't include stream parsers because there's no consensus on a "common" meta-format that these parsers would target. The only alternative would be to bundle into each program sufficient functionality to allow the user to define a parser that the program would use to process their input. But that's a complicated thing to implement, a complicated thing to use, and if different utilities had different implementations of that parser, with different variations on the syntax used to define the parser... That would turn into a major headache quickly. Hence, the whole idea here, of establishing a "common meta-format" for the shell and its core utilities (which is, some would claim, an idea that flies in the face of everything Unix stands for) which would eliminate the whole problem of telling a consumer program how a connected generator program has chosen to encapsulate its value fields. The underlying mechanisms of creating and parsing those streams are no less complicated, but the fact that the various programs would already support such a format (and the corresponding data model - the common set of ideas about data structures that come with the format) means that the user doesn't have to explicitly code that stuff into his scripts.

If we're choosing a meta-format for this job, then we have the opportunity to pick something that would solve so many problems that it could dramatically increase the power of the Unix shell at the same time. All kinds of problems that are presently "too difficult to bother with" could suddenly become much easier. Potentially so much easier that we won't even think of them as "problems" at all. I think that is the situation with nested structures: if we make them easy to use, we'll stop avoiding them, and take advantage of them in situations where it makes sense to do so.

I have to believe nested structures are useful because we use them all the time in other programming languages, in our filesystems, in our documents... It's an idea that clearly works. You have to figure, also, that one common scenario for nested structures will be that you're actually just encoding and passing over the stream a piece of data that was created somewhere else, data which was originally represented as some sort of nested structure. If the streaming format supports those data structure concepts, then the translation process is pretty straightforward, and if two different people were to guess how that structure would be translated, it's likely they'd arrive at the same answer, and be unsurprised to find it's the same answer the computer came up with.

CSV is extensible. If I wanted to I could add another field, it would just have to come last (since CSV uses location to determine references, where as JSON and XML use names to find things.)


There are all kinds of scenarios where this just isn't adequate. We've seen plenty of them already in the classic UNIX tools, most of which do use some kind of line-based, "simple" delimited-field format.

A very basic example is, what happens to the format after a lot of these additions/removals are performed? You wind up with a bunch of unused fields kept around as vestigial place-holders, and a bunch of new fields tacked on to the end.

Or what if two different people, working on diverging implementations of the same program, both add a new field to that stream format? Naturally, they'll both add the new field on the end, and in both cases it'll be the Nth field. This is the sort of thing you might get from, say, different implementations of "ls -l" or "ps". Scripts working on the output of those utilities won't be portable because that Nth field has different meaning depending on which version of that utility is installed.

By contrast: if the fields are marked with some kind of tag, something non-ordinal with plenty of available space for meaningfully defining new tags, then there's at least a pretty good chance the two implementations won't clash. You could still get scenarios like both implementations defining a field called "extended-permissions" or something similarly generic, but there's at least a decent chance that they won't, and (if they do) it's relatively easy to correct the issue by getting the implementers to coordinate a bit and avoid reusing each others' tags (or respect a tag naming scheme that would keep their fields from clashing - like a domain name-based scheme, for instance) unless they're truly compatible.

The questions can go deeper, if you are willing to go that far: for instance, the PNG format makes an effort to provide answers to questions like, "If I don't recognize this field, can I just ignore it?" (Personally I'm not sure if I will ever go that far - though it's a nice feature to have if you want to make scripts really reliable. You can then do things like say, "I don't recognize this field, but it's marked as ancillary so it's OK." instead of "I don't recognize this field called 'comment' - PANIC!" Though I think the question of whether a piece of data is "ancillary" may be too complicated to answer with a simple Boolean value.) Another useful feature might be to name fields in such a way that, even if you don't recognize the field, you can know what sort of data it's providing: for instance, a prefix like "perm" might mean "this field tells you who can read/write/execute this file" - while the field itself defines some permission scheme beyond the scope of classic Unix - like access control lists or whatever. Which, incidentally, would be another case where nested structures would be useful. :)


(1: "getting it right" is one of the main reasons I want to create a new shell in the first place. There's all these cases where creating a shell script to do a job is just as easy as it should be - as long as you don't hit one of those cases that breaks the simple version of the script. Getting it right for all cases is harder, especially these different utilities, even different versions of utilities, don't agree on even a basic set of rules about how streams are formatted. My aim is to make a shell that helps to correct this situation. Using type information means the shell can tell the user when some things aren't going to work... Reliably delineating the boundaries of value data means the user doesn't have to cope with quoting rules when writing a stream-processing script... And giving the serialization format some flexibility, in the form of named fields and nested structures, gives the users greater expressive power to do the things they want to do, without resorting to arcane measures.)

(2: The regional issues surrounding the use of the comma as a decimal point in some countries has given me cause to reflect on the syntax I'm designing for my shell: ideally, I would want people in those regions to be able to use comma in the way they're accustomed to using it. If comma is a decimal point where they live, then it should be a decimal point in their shell. But as it stands, in my design, I use comma as a high precedence command/value separator, I rely on it heavily. Semicolon performs a similar job but it has lower precedence in the syntax. I could potentially implement a mode in which comma is not a command separator at all, and in which it serves as a decimal point in the numeric syntax... The main problem there is that if I provide regional "modes" for the syntax itself, then that impacts cross-region script compatibility. I have given some thought to problems of writing reliably portable scripts in general: things like a "portable script" mode, which would include flagging auto-conversions as portability errors - so the portable script mode could dictate that comma is a separator, or require that the comma mode be explicitly stated in the script... Things like that. Apart from that, there would just be issues of the syntax being slightly less convenient to use without the comma as a separator (having to put parens around things... In any case, though, I don't think the serialization format, if it's text-based, should have to support a similar mode switch. The serialization format shouldn't generally be something people work directly with, so it's not subject to UI considerations like L10N.)
---GEC
I want to create a truly new command-line shell for Unix.
Anybody want to place bets on whether I ever get any code written?

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Wed Aug 29, 2012 10:48 pm UTC

tetsujin wrote:
tomCar wrote:
tetsujin wrote:I think there's not much to be gained from going with something simpler than JSON, and probably something to be lost. Getting reliable delimitation of values is great, but that's pretty much all you get... And while it's great that two or three utilities already support CSV, it doesn't amount to much. Two or three utilities support null-delimited values, and they're not the same ones. There's probably another format or two that fit the pattern. Point is, that still leaves the majority unaddressed...

It's actually the reverse case. Most utilities are already nearly CSV compliant. It's just a matter of changing delimiters and/or properly escaping the output. CSV is also fairly widely available, and if not it's trivial to parse/generate.


Only if you ignore the deeper issues.

For instance, how do you embed a comma, or a newline, as part of one of your fields? The common rules are:

1: If your field contains anything funky, wrap it in double quotes.
2: If you want to put a double-quote character in a field, double it.

Now if you want to actually follow those rules, then parsing CSV is harder than just setting $IFS or whatever. You need to recognize things like the fact that a comma within quotes isn't a field separator, and a newline within quotes isn't a record separator. You can't just readline() and split(). It's still an easy format to parse, but I would argue:

1: it's not significantly easier to parse than JSON, not if you want to get it right1. That is, you need to actually parse it, rather than just reading a line at a time and scanning for commas.
2: If you use a CSV library to parse it (and, thus, get it right) - then it's no easier than using, say, a JSON library to do the same thing.
3a: the fact that implementations of CSV so frequently ignore the quoting issue means that you'll wind up in scenarios where you'll think you can just pass a chunk of data to a particular program, when in fact that program's gonna mangle the stream. (This is much less of an issue with JSON, as the format was established with most or all of those decisions already made, and there's a clear authority on decisions regarding the JSON format - go to json.org and you will see the rules that define the JSON format. Go to Wikipedia and you will see the most common rules used to define CSV - along with regional variations (like semicolons as field separator for countries that use comma as a decimal point in numbers2), inconsistently-supported rules (like all the quoting rules - the stuff that makes payload encapsulation fully-featured and robust, but which is often omitted from CSV parser implementations because people think, "Oh, comma-separated values, I'll just call split()."), would-be "standards" like RFC4180, and so on.)
3b: Alternately, if you say, "I'm just gonna use the simple subset of CSV that doesn't use quoting rules" - then you can't encapsulate quotes or newlines inside field values, and you'll choke if you process a stream written by a program that does use that syntax.

It certainly is true that CSV lacks a formal specification. However, I don't feel that it matters. I'm merely using "CSV" as a convenient place holder term. The correct phrasing would be something more like "the format that Go recognizes as CSV, which happens to follow the most common rules." Additionally, I fully intend for the separating value to be replaceable. That is, by default, the format is "CSV" but you can easily throw a "-s=:" flag to convert to using colons (this functionality is inherited from Go's library.)

In short, I have a rigidly defined format in mind and it's helpful to explain it as CSV because a CSV library will be able to read the format.
CSV can easily accomplish meta or binary data.


Well, yes, more or less: but there are a few things to consider:

1: the receiving program doesn't know it's binary data, let alone what it's supposed to represent.
2: By ASCII-encoding the payload you're inflating its size a lot. (Doubling it for a HEX encoding, or increasing its size by about 1/3 in the case of base-64)
3: If you use an encoding like base-64, then you lose a valuable property of a bytestream: the byte boundaries of the source data no longer fall on byte boundaries in the encoding. This means, for instance, if you have five bytes of source data ready to send over the pipe, you can only send four of them: The first three bytes of the payload are encoded as four bytes of base-64, then the next byte and the first four bits of the next byte are sent as the next two bytes of base-64. You can't send the next byte of base-64 (containing the remaining four bits of the fifth source byte) until you have the sixth source byte ready to go. If this doesn't sound like a big deal, consider the case where a program is streaming data it receives over the network, or over a pipeline from another program. In that case, you don't know when or even if that sixth byte is going to be available. But the next program in the pipeline may nevertheless be able to take action if it gets that fifth source byte: for instance, after seeing the fifth byte it may be prepared to close the pipe (causing a SIGPIPE on the sender next time it tries to send more data), or it may have enough information to send another piece of data to the next program in the pipe after itself.


1. JSON/XML don't solve this problem. You still need a convention for it.
2. Same with JSON/XML
3. And JSON/XML help with this how?
It is possible to define an encoding within the bounds of CSV that will provide the receiver with information like the fact that the string of characters in the field represents some binary encoding of data, or that the binary blob actually should be interpreted as some media format or other. But the point is that since CSV doesn't provide even the most basic mechanisms for doing so, then to provide that functionality you'd have to essentially invent a new format, using CSV as nothing more than the base-level encoding for that stream. You would either need to get all the various utilities to recognize this common format you've implemented on top of CSV (meaning it's not really "CSV" you're establishing as the baseline format, but rather something on top of it), or else let them all remain ignorant of the relevance of that binary encoding, which is equivalent to not providing any better support for binary data than what already exists in the shell and in its associated utilities.

It's exactly the same as in JSON/XML. To indicate format you need a field/name convention. If adding this information corresponds to creating a new format, then every single different type of object represented in JSON/XML is a different format. I've already shown how easy it is to add extra information, it does not require a new format.
And if you did invent that layer on top of CSV, containing your conventions for establishing different data types, etc. - a fair bit of that work would be essentially reinventing what JSON's already providing. Things like nested structures, possibly with named fields (as a rudimentary way of attaching metadata)

Except you still need to do this in JSON. At best JSON provides you with 5 types (array, object, string, number, boolean,) these are hardly exhaustive, so you will need to define conventions for indicating other types even in JSON. Metadata doesn't require named fields; named fields can be added in CSV without issue (however, doing so would make the choice of CSV over JSON basically meaningless, so it's my intention not to do so.)
Nested structures are also possible, however, it's a mess and nesting just isn't useful enough to bother. Especially since relational structures accomplish the same goal (and are easily expressed in CSV.)


I don't agree that nested structures aren't "useful enough" - and your criterion there, "useful enough to bother", is skewed by the fact that you're starting with a format that doesn't naturally lend itself to that kind of functionality.

For instance, what's "useful enough to bother" using nested structures in JSON? It's trivial. You just use 'em. Anything that processes JSON will understand that it's a nested structure, and any JSON parser will be able to tell you exactly what the correct (decoded) payload for each field is.

You're missing the point, the point was that no utility uses them and that I don't believe nested structures would make life easier. In other words, show me how you'd use nested structures to simplify the usage of some utility. Then we can discuss whether or not nested structures are useful.
In CSV, it involves questions like, what character escaping mechanism are you using? What do you use as a secondary delimiter for the nested fields? How do you encapsulate the secondary delimiter within a value field? Such decisions are beyond the scope of CSV itself. Again, it's a matter of having to define a set of conventions on top of CSV to get that kind of functionality.

Those questions are already answered and were not a big deal in the first place.
Expressing a nested structure by linking to other records has a few problems: for instance, you may not want to treat the nested data the same way as you treat the record containing it. If you're filtering records out of the stream, for instance, you'd want the nested data to go away if you filter out the record containing it. But if your "nested" data is really just another record referenced in the field of another record in the stream, then your filter program doesn't know that, and if the "nested" record passes the filter rules then it'll go to the next stage of stream processing.

This could be an issue. I don't intend to use relations in the first place, I was merely suggesting that they can accomplish the same goals.
However, I will make note that there's no reason a filter program couldn't know how to remove residual records. After all, a referenced record probably belongs to a different "table" and if we're removing the field referencing that table then it shouldn't be that hard to find the table and remove it as well. If we're referencing something from the same table... well we probably don't want to remove the record anyways (as it probably has useful information.)
Using nested structures in data streams isn't "useful enough to be worth the trouble" right now because it really is quite a bit of trouble. Using any kind of delimiter means either you can't have that delimiter as part of a field payload, or you need to define a syntax to work around that, and then parse the stream instead of just scanning for delimiters. To nest structures, you either need some kind of encapsulation syntax (which pretty much requires true parsing), or another delimiter for each level of nesting (which, again, either means another character you can't encode as part of a field payload, or an escape syntax that you'll need to parse.)

As I've said, demonstrate that nesting will be useful. There may be very good reasons why it wasn't included in the original tools. Until you come up with an example of a nested structure that makes life easier... there's no way to progress on this topic. Otherwise, I'm perfectly familiar with what nesting requires.
Working with a stream format you need to parse sucks right now because most tools don't include stream parsers. Most tools don't include stream parsers because there's no consensus on a "common" meta-format that these parsers would target. The only alternative would be to bundle into each program sufficient functionality to allow the user to define a parser that the program would use to process their input. But that's a complicated thing to implement, a complicated thing to use, and if different utilities had different implementations of that parser, with different variations on the syntax used to define the parser... That would turn into a major headache quickly. Hence, the whole idea here, of establishing a "common meta-format" for the shell and its core utilities (which is, some would claim, an idea that flies in the face of everything Unix stands for) which would eliminate the whole problem of telling a consumer program how a connected generator program has chosen to encapsulate its value fields. The underlying mechanisms of creating and parsing those streams are no less complicated, but the fact that the various programs would already support such a format (and the corresponding data model - the common set of ideas about data structures that come with the format) means that the user doesn't have to explicitly code that stuff into his scripts.

No disagreement here.
If we're choosing a meta-format for this job, then we have the opportunity to pick something that would solve so many problems that it could dramatically increase the power of the Unix shell at the same time. All kinds of problems that are presently "too difficult to bother with" could suddenly become much easier. Potentially so much easier that we won't even think of them as "problems" at all. I think that is the situation with nested structures: if we make them easy to use, we'll stop avoiding them, and take advantage of them in situations where it makes sense to do so.

I'm not opposed to nested structures. I just don't think that there's enough information to come to the conclusion that they are a necessary feature in a data format. I've suggested the relational model can fill in this gap, but I don't really want to use it either (because I'm not sure there's any need for it either.)
I have to believe nested structures are useful because we use them all the time in other programming languages, in our filesystems, in our documents... It's an idea that clearly works. You have to figure, also, that one common scenario for nested structures will be that you're actually just encoding and passing over the stream a piece of data that was created somewhere else, data which was originally represented as some sort of nested structure. If the streaming format supports those data structure concepts, then the translation process is pretty straightforward, and if two different people were to guess how that structure would be translated, it's likely they'd arrive at the same answer, and be unsurprised to find it's the same answer the computer came up with.

You're begging the question here. Assuming that we have a nested structure to deal with before we've decided to represent something as a nested structure.
Personally, I've never seen a nested structure in a document and I would just as soon the file system not be a nested structure (that it be relational.)
P.L.s are a little different though. They can (and usually do) define their structures before actually creating/using them. This means that if you want to do something with a structure you don't have to look at all of it, you can just look at the 1 component you want. This is not the case on the shell. The entire data structure is going to get written to your terminal and if you have to wade through a nested structure on the terminal... well good luck (I don't envy you.) In this case, references work better since you don't have everything and the kitchen sink in one place.
CSV is extensible. If I wanted to I could add another field, it would just have to come last (since CSV uses location to determine references, where as JSON and XML use names to find things.)


There are all kinds of scenarios where this just isn't adequate. We've seen plenty of them already in the classic UNIX tools, most of which do use some kind of line-based, "simple" delimited-field format.

A very basic example is, what happens to the format after a lot of these additions/removals are performed? You wind up with a bunch of unused fields kept around as vestigial place-holders, and a bunch of new fields tacked on to the end.

Because ",,,,Hi" is so much worse than "{"field":"Hi"}" which is assuming that I would even have those fields stick around. If it's unused then it can be safely deleted in most cases.
Or what if two different people, working on diverging implementations of the same program, both add a new field to that stream format? Naturally, they'll both add the new field on the end, and in both cases it'll be the Nth field. This is the sort of thing you might get from, say, different implementations of "ls -l" or "ps". Scripts working on the output of those utilities won't be portable because that Nth field has different meaning depending on which version of that utility is installed.

Not my problem. They should have thought it through better. Which is, by the way, what I want to encourage. I don't want people treating their data structures lightly. I don't want people doing something simply because they could.
By contrast: if the fields are marked with some kind of tag, something non-ordinal with plenty of available space for meaningfully defining new tags, then there's at least a pretty good chance the two implementations won't clash. You could still get scenarios like both implementations defining a field called "extended-permissions" or something similarly generic, but there's at least a decent chance that they won't, and (if they do) it's relatively easy to correct the issue by getting the implementers to coordinate a bit and avoid reusing each others' tags (or respect a tag naming scheme that would keep their fields from clashing - like a domain name-based scheme, for instance) unless they're truly compatible.

Oh, hey. Coordination. Developers can do that? 8-)
It's really not that hard to get together and agree to a single order. No more so than it is to get people to not use the same fields.
The questions can go deeper, if you are willing to go that far: for instance, the PNG format makes an effort to provide answers to questions like, "If I don't recognize this field, can I just ignore it?" (Personally I'm not sure if I will ever go that far - though it's a nice feature to have if you want to make scripts really reliable. You can then do things like say, "I don't recognize this field, but it's marked as ancillary so it's OK." instead of "I don't recognize this field called 'comment' - PANIC!" Though I think the question of whether a piece of data is "ancillary" may be too complicated to answer with a simple Boolean value.) Another useful feature might be to name fields in such a way that, even if you don't recognize the field, you can know what sort of data it's providing: for instance, a prefix like "perm" might mean "this field tells you who can read/write/execute this file" - while the field itself defines some permission scheme beyond the scope of classic Unix - like access control lists or whatever. Which, incidentally, would be another case where nested structures would be useful. :)

Lists are best represented as lists... not nested structures :P
At no point in time is a script creator free from having to think about what's going on. The creator will know what fields need to exist and what can be discarded. Having the fields have required names may make this simpler. I can agree to that, but I'm not sure whether it's worth while... I just think it's in the same class of improvements such as how it would be easier to just use '/' as all forms of division in Haskell, but it's really not an issue that you need to use 'div' for ints.

See there's two kinds of useful: useful because I like it and useful because it gets stuff that needs doing done. I care only about the later. So if I can't see what would require a feature to be doable... I can't in good conscience include that feature (no definite pro but minor definite con = no such feature.)
(1: "getting it right" is one of the main reasons I want to create a new shell in the first place. There's all these cases where creating a shell script to do a job is just as easy as it should be - as long as you don't hit one of those cases that breaks the simple version of the script. Getting it right for all cases is harder, especially these different utilities, even different versions of utilities, don't agree on even a basic set of rules about how streams are formatted. My aim is to make a shell that helps to correct this situation. Using type information means the shell can tell the user when some things aren't going to work... Reliably delineating the boundaries of value data means the user doesn't have to cope with quoting rules when writing a stream-processing script... And giving the serialization format some flexibility, in the form of named fields and nested structures, gives the users greater expressive power to do the things they want to do, without resorting to arcane measures.)
[\quote]
If I felt expressive power was the problem I would agree with you. But I don't. I feel that inconsistency is the problem. In which case, I don't want to change things for people too much, but just enough to make everything consistent.
(2: The regional issues surrounding the use of the comma as a decimal point in some countries has given me cause to reflect on the syntax I'm designing for my shell: ideally, I would want people in those regions to be able to use comma in the way they're accustomed to using it. If comma is a decimal point where they live, then it should be a decimal point in their shell. But as it stands, in my design, I use comma as a high precedence command/value separator, I rely on it heavily. Semicolon performs a similar job but it has lower precedence in the syntax. I could potentially implement a mode in which comma is not a command separator at all, and in which it serves as a decimal point in the numeric syntax... The main problem there is that if I provide regional "modes" for the syntax itself, then that impacts cross-region script compatibility. I have given some thought to problems of writing reliably portable scripts in general: things like a "portable script" mode, which would include flagging auto-conversions as portability errors - so the portable script mode could dictate that comma is a separator, or require that the comma mode be explicitly stated in the script... Things like that. Apart from that, there would just be issues of the syntax being slightly less convenient to use without the comma as a separator (having to put parens around things... In any case, though, I don't think the serialization format, if it's text-based, should have to support a similar mode switch. The serialization format shouldn't generally be something people work directly with, so it's not subject to UI considerations like L10N.)

I suppose it's good to consider regional variations...but I really don't care. It's not that much trouble to use decimal points to indicate decimal place. I've worked in both systems... and it just isn't a big deal to get users to switch. Of course, with my tools... the user could easily switch to using colons as field separators and neatly side-step the issue.

EvanED
Posts: 4331
Joined: Mon Aug 07, 2006 6:28 am UTC
Location: Madison, WI
Contact:

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby EvanED » Thu Aug 30, 2012 3:30 am UTC

tomCar wrote:It certainly is true that CSV lacks a formal specification. However, I don't feel that it matters. I'm merely using "CSV" as a convenient place holder term. The correct phrasing would be something more like "the format that Go recognizes as CSV, which happens to follow the most common rules." Additionally, I fully intend for the separating value to be replaceable. That is, by default, the format is "CSV" but you can easily throw a "-s=:" flag to convert to using colons (this functionality is inherited from Go's library.)

Don't discount the fact that you need a description like that to describe what your format is. Saying CSV isn't enough to describe the structure. Saying JSON is.

"I have a rigidly defined format" isn't enough -- you also have to be able to tell people what that format is.

It's exactly the same as in JSON/XML. To indicate format you need a field/name convention.

Except that in the case of JSON, JSON is the field/name convention. (Well, I guess you have to say "objects are JSON dictionaries/objects"). Contrast with CSV: you have to say something like "lines with some special marker (or at some special location) give the names of the fields in each record". [But see two responses from now; there's a more fundamental disagreement that obviates what I say here.]

I've already shown how easy it is to add extra information, it does not require a new format.

You've shown how to do it. Tetsujin (I suspect) and I don't think that you've shown how to do it well. Being forced to keep around extra fields that are not applicable is a significant problem; you simply dismissed it out of hand.

Except you still need to do this in JSON. At best JSON provides you with 5 types (array, object, string, number, boolean,) these are hardly exhaustive, so you will need to define conventions for indicating other types even in JSON. Metadata doesn't require named fields; named fields can be added in CSV without issue (however, doing so would make the choice of CSV over JSON basically meaningless, so it's my intention not to do so.)

Ah, the fact that you don't have named fields is very important -- I think this is the fundamental disagreement. My thinking makes named fields basically essential for usability. If an object has 3 numbers in it or something like that, I don't want the user to have to remember "the first is this, the second is that, the third is this other thing." The user should be able to filter rows using properties over colums given by name, or request that a table be printed with column headings. Take the output of ps aux for instance -- I want something like that, but even if you pass the output through the equivalent of grep first (which currently will drop the otherwise extremely useful column headings.)

If you don't care about that sort of thing, I can see how CSV makes sense -- but that's a pretty strong disagreement there.

However, I will make note that there's no reason a filter program couldn't know how to remove residual records.

What if a column is not applicable to some objects but is applicable to others? (For instance, a setuid? column for directories and files.)

As I've said, demonstrate that nesting will be useful.

Code: Select all

$ ls
{ "name" : "bar.txt",
  "mtime" : { "year": 2009, "month": 8, "day": 29, "hour": 22, "minute": 21} }
{ "name" : "foo.txt",
  "mtime" : { "year": 2012, "month": 8, "day": 29, "hour": 22, "minute": 21} }

$ ls | select where .mtime.year ">=" 2010
{ "name" : "foo.txt",
  "mtime" : { "year": 2012, "month": 8, "day": 29, "hour": 22, "minute": 21} }

instead of

Code: Select all

$ ls
{ "name" : "bar.txt",
  "mtime" : 1251602460 }
{ "name" : "foo.txt",
  "mtime" : 1346296860 }

$ ls | select where .mtime ">=" 1262325600
{ "name" : "foo.txt",
  "mtime" : 1346296860 }


Sure, you could add some special date handling to the shell utilities. And then someone will come up with another type that's nested, and you'll have to add that. Then I'll give you another. Wouldn't it be better to make it Just Work from the start?

Or as another example, using a format that supports nesting means that the shell utilities can operate on formats with nesting (admittedly with conversion). For instance, I could create a shell pipeline that works on XML configuration files. Using a format like CSV will prevent you from doing that, unless you encode the nesting in some obnoxious way.

There may be very good reasons why it wasn't included in the original tools.

You can say that about anything. "There may be very good reasons why CSV wasn't used in the original tools."

Edit: I guess you could say

Code: Select all

$ ls
{ "name" : "bar.txt",
  "mtime-year": 2009, "mtime-month": 8, "mtime-day": 29, "mtime-hour": 22, "mtime-minute": 21 }
{ "name" : "foo.txt",
  "mtime-year": 2012, "mtime-month": 8, "mtime-day": 29, "mtime-hour": 22, "mtime-minute": 21}

$ ls | select where .mtime-year ">=" 2010
{ "name" : "foo.txt",
  "mtime-year": 2012, "mtime-month": 8, "mtime-day": 29, "mtime-hour": 22, "mtime-minute": 21}


But in addition to being (IMO) really ugly, it runs into problems if you want to hold, say, a list of files.

tomCar
Posts: 2
Joined: Tue Aug 14, 2012 6:14 pm UTC

Re: My Unix CLI manifesto, aka why PowerShell is the bees kn

Postby tomCar » Thu Aug 30, 2012 6:49 am UTC

EvanED wrote:
It's exactly the same as in JSON/XML. To indicate format you need a field/name convention.

Except that in the case of JSON, JSON is the field/name convention. (Well, I guess you have to say "objects are JSON dictionaries/objects"). Contrast with CSV: you have to say something like "lines with some special marker (or at some special location) give the names of the fields in each record". [But see two responses from now; there's a more fundamental disagreement that obviates what I say here.]

I should probably clarify that I was talking about metadata. You need a convention to denote which field contains type information (or whatever other metadata you need.) This requires a means of specifying that field, in CSV it's a number and in JSON/XML it's a name, but any program that works with the data will need to be told what that is (therefore, a convention.)

I've already shown how easy it is to add extra information, it does not require a new format.

You've shown how to do it. Tetsujin (I suspect) and I don't think that you've shown how to do it well. Being forced to keep around extra fields that are not applicable is a significant problem; you simply dismissed it out of hand.

So you find that the method that you're using doesn't work well? That's a little confusing. 'Cause if you didn't notice my example was patterned after how metadata is introduced in JSON/XML. There is of course, your idea of using type headers. I'm not certain how well it would work (since I'd need to add some way of detecting them,) but it could work.
Except you still need to do this in JSON. At best JSON provides you with 5 types (array, object, string, number, boolean,) these are hardly exhaustive, so you will need to define conventions for indicating other types even in JSON. Metadata doesn't require named fields; named fields can be added in CSV without issue (however, doing so would make the choice of CSV over JSON basically meaningless, so it's my intention not to do so.)

Ah, the fact that you don't have named fields is very important -- I think this is the fundamental disagreement. My thinking makes named fields basically essential for usability. If an object has 3 numbers in it or something like that, I don't want the user to have to remember "the first is this, the second is that, the third is this other thing." The user should be able to filter rows using properties over colums given by name, or request that a table be printed with column headings. Take the output of ps aux for instance -- I want something like that, but even if you pass the output through the equivalent of grep first (which currently will drop the otherwise extremely useful column headings.)

Where as, I don't really think it's difficult to keep in mind the record structure that you're using. Knowing the names of the fields means that you already know what structure you're using. Records may not be as self-descriptive as an object, but they are fairly intuitive if you've read the manual (see the output of ls -l.) For someone learning the shell as a "first language" I could see named fields as being hugely beneficial... but I'm not inclined to tailor my format for a single group. If you find it hard to remember what you're working with, you still have options in CSV. You could write a comment or put an echo statement describing each field at the top of your script and refer to that when you get lost. However, for people who don't need that extra information, CSV represents a more compact choice (plus I think it's easier to read in general.)
However, I will make note that there's no reason a filter program couldn't know how to remove residual records.

What if a column is not applicable to some objects but is applicable to others? (For instance, a setuid? column for directories and files.)

Just leave it blank then. It's not expensive at all, to have blank fields in CSV.

Code: Select all

info,,more info

If having extra commas isn't your thing... it should be possible to identify rows with less fields and treat them differently (of course, CSV doesn't handle multiple optional fields as well as JSON would, but I don't know if that's a common enough scenario that I need to worry about it.)
As I've said, demonstrate that nesting will be useful.

Code: Select all

$ ls
{ "name" : "bar.txt",
  "mtime" : { "year": 2009, "month": 8, "day": 29, "hour": 22, "minute": 21} }
{ "name" : "foo.txt",
  "mtime" : { "year": 2012, "month": 8, "day": 29, "hour": 22, "minute": 21} }

$ ls | select where .mtime.year ">=" 2010
{ "name" : "foo.txt",
  "mtime" : { "year": 2012, "month": 8, "day": 29, "hour": 22, "minute": 21} }

instead of

Code: Select all

$ ls
{ "name" : "bar.txt",
  "mtime" : 1251602460 }
{ "name" : "foo.txt",
  "mtime" : 1346296860 }

$ ls | select where .mtime ">=" 1262325600
{ "name" : "foo.txt",
  "mtime" : 1346296860 }


Personally, I see nothing wrong with the second form. However, your later flat examples look fine in CSV.

Code: Select all

$ ls
bar.txt,2009,8,29,22,21
foot.txt,2012,8,29,22,21
$ ls | select where .2 ">=" 2010
foot.txt,2012,8,29,22,21

Of course, this ignores the fact that ls wouldn't actually output just names and dates... but whatever, it works for discussion. I didn't make it clear, but I intend to include pattern matching (with several predefined patterns, plus wild card notation.) So I'll probably have YYYY/MM/DD.hh:mm:ss as my default date/time format.
Sure, you could add some special date handling to the shell utilities. And then someone will come up with another type that's nested, and you'll have to add that. Then I'll give you another. Wouldn't it be better to make it Just Work from the start?

Dates don't really need to be nested, flat structures (even specially formatted) work fine. You may not prefer it, but I'm more than content to play the "come up with a nested format and I'll tell you how to do it with a flat structure" game. Beside, it's not my responsibility to deal with 3rd party data structures. I'm only interested in the core utilities and whether or not nested structures would make using them easier (and dates are a poor example, considering the nested structure looks horrible.)
Or as another example, using a format that supports nesting means that the shell utilities can operate on formats with nesting (admittedly with conversion). For instance, I could create a shell pipeline that works on XML configuration files. Using a format like CSV will prevent you from doing that, unless you encode the nesting in some obnoxious way.

And how would you deal with INI or plain text config files? It's nice to get things "for free," but there's lots of other config file formats that neither of use will be able to easily operate on.
There may be very good reasons why it wasn't included in the original tools.

You can say that about anything. "There may be very good reasons why CSV wasn't used in the original tools."

Except, CSV was used. Just not with commas as the separator and nor was it used consistently.


Return to “Religious Wars”

Who is online

Users browsing this forum: No registered users and 5 guests