DGSH - directed graph shell. adding parallelism to shell & pipes

classic Classic list List threaded Threaded
5 messages Options
Reply | Threaded
Open this post in threaded view
|

DGSH - directed graph shell. adding parallelism to shell & pipes

Samba - linux mailing list
This is an interesting take on a 25+ year-old idea of ‘Multipipes’ in the Unix shell. Much more than the ‘parallel’ command or managing a bunch of named pipes.
This one is based on ‘bash’ with another 12 or so commands modified to read & write to multiple pipes.

One that appeals to me is ‘grep’. It takes 0-2 input streams and writes to 0-4 streams.
        Available output streams (via arguments): matching files, non-matching files, matching lines, and non-matching lines

The paper uses the same examples & diagrams as the website, but has much more discussion, a good history of the topic and 46 references.

The design & examples are about a very Unix-y thing: streaming data and processing it just once. Not have to save intermediate files and reprocess them multiple times.
In a world of many cores and ‘Big Data’, being able to ‘naturally’ process data streams in parallel is an important new facility.
It’s even useful at the other end of the spectrum where I/O bandwidth & storage space is limited. On low-power, low-performance “IoT” devices like Single Board Computers and low-end smartphones.
Will we see a version built for ‘busybox’? It’s possible because of the design’s “coupling and cohesion” choices.

They’ve thought about the design and implementation - limiting it to a limited syntax change to the (bash) shell.
Not sure how well tested & debugged it is, but because of the design you’d think there wouldn’t be many.

regards
steve

———————————

dgsh — directed graph shell
<https://www.spinellis.gr/sw/dgsh/#intro>
> The directed graph shell, dgsh (pronounced /dæɡʃ/ — dagsh), provides an expressive way to construct sophisticated and efficient big data set and stream processing pipelines using existing Unix tools as well as custom-built components.
> It is a Unix-style shell (based on bash) allowing the specification of pipelines with non-linear non-uniform operations.
> These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the operation's processing throughput.
>
> If you want to get a feeling on how dgsh works in practice, skip right down to the examples section.
>
> For a more formal introduction to dgsh or to cite it in your work, see:
> Diomidis Spinellis and Marios Fragkoulis. Extending Unix Pipelines to DAGs. IEEE Transactions on Computers, 2017. doi: 10.1109/TC.2017.2695447


Nuclear magnetic resonance processing - 12-stage pipeline run in parallel
<https://www.spinellis.gr/sw/dgsh/#NMRPipe>


Extending Unix Pipelines to DAGs
        Diomidis Spinellis, Senior Member, IEEE
        Marios Fragkoulis
<http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7903579>
>
> Abstract—The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines using standard Unix tools, as well as third-party and custom-built components.
> Dgsh allows the specification of pipelines that perform non-uniform non-linear processing.
> These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput.
> A number of existing Unix tools have been adapted to take advantage of the shell’s multiple pipe input/output capabilities.
> The shell supports visualization of the process graphs, which can also aid debugging.
> Dgsh was evaluated through a number of common data processing and domain-specific examples, and was found to offer an expressive way to specify processing topologies, while also generally increasing processing throughput.
>
> Index Terms—Process-level parallelism, Unix, pipeline, pipes and filters architecture

--
Steve Jenkin, IT Systems and Design
0412 786 915 (+61 412 786 915)
PO Box 38, Kippax ACT 2615, AUSTRALIA

mailto:[hidden email] http://members.tip.net.au/~sjenkin


--
linux mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux
Reply | Threaded
Open this post in threaded view
|

Re: DGSH - directed graph shell. adding parallelism to shell & pipes

Samba - linux mailing list
Steve,

Thanks for posting this.
I have been contemplating adding something similar to VICI.
I will have to read up on this to see how they manage the interaction
between the data flow and the flow of control.

Cheers
Brenton

On Wed, 2017-07-12 at 09:35 +1000, steve jenkin via linux wrote:

> This is an interesting take on a 25+ year-old idea of ‘Multipipes’ in the Unix shell. Much more than the ‘parallel’ command or managing a bunch of named pipes.
> This one is based on ‘bash’ with another 12 or so commands modified to read & write to multiple pipes.
>
> One that appeals to me is ‘grep’. It takes 0-2 input streams and writes to 0-4 streams.
> Available output streams (via arguments): matching files, non-matching files, matching lines, and non-matching lines
>
> The paper uses the same examples & diagrams as the website, but has much more discussion, a good history of the topic and 46 references.
>
> The design & examples are about a very Unix-y thing: streaming data and processing it just once. Not have to save intermediate files and reprocess them multiple times.
> In a world of many cores and ‘Big Data’, being able to ‘naturally’ process data streams in parallel is an important new facility.
> It’s even useful at the other end of the spectrum where I/O bandwidth & storage space is limited. On low-power, low-performance “IoT” devices like Single Board Computers and low-end smartphones.
> Will we see a version built for ‘busybox’? It’s possible because of the design’s “coupling and cohesion” choices.
>
> They’ve thought about the design and implementation - limiting it to a limited syntax change to the (bash) shell.
> Not sure how well tested & debugged it is, but because of the design you’d think there wouldn’t be many.
>
> regards
> steve
>
> ———————————
>
> dgsh — directed graph shell
> <https://www.spinellis.gr/sw/dgsh/#intro>
> > The directed graph shell, dgsh (pronounced /dæɡʃ/ — dagsh), provides an expressive way to construct sophisticated and efficient big data set and stream processing pipelines using existing Unix tools as well as custom-built components.
> > It is a Unix-style shell (based on bash) allowing the specification of pipelines with non-linear non-uniform operations.
> > These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the operation's processing throughput.
> >
> > If you want to get a feeling on how dgsh works in practice, skip right down to the examples section.
> >
> > For a more formal introduction to dgsh or to cite it in your work, see:
> > Diomidis Spinellis and Marios Fragkoulis. Extending Unix Pipelines to DAGs. IEEE Transactions on Computers, 2017. doi: 10.1109/TC.2017.2695447
>
>
> Nuclear magnetic resonance processing - 12-stage pipeline run in parallel
> <https://www.spinellis.gr/sw/dgsh/#NMRPipe>
>
>
> Extending Unix Pipelines to DAGs
> Diomidis Spinellis, Senior Member, IEEE
> Marios Fragkoulis
> <http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7903579>
> >
> > Abstract—The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines using standard Unix tools, as well as third-party and custom-built components.
> > Dgsh allows the specification of pipelines that perform non-uniform non-linear processing.
> > These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput.
> > A number of existing Unix tools have been adapted to take advantage of the shell’s multiple pipe input/output capabilities.
> > The shell supports visualization of the process graphs, which can also aid debugging.
> > Dgsh was evaluated through a number of common data processing and domain-specific examples, and was found to offer an expressive way to specify processing topologies, while also generally increasing processing throughput.
> >
> > Index Terms—Process-level parallelism, Unix, pipeline, pipes and filters architecture
>
> --
> Steve Jenkin, IT Systems and Design
> 0412 786 915 (+61 412 786 915)
> PO Box 38, Kippax ACT 2615, AUSTRALIA
>
> mailto:[hidden email] http://members.tip.net.au/~sjenkin
>
>


--
linux mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux
Reply | Threaded
Open this post in threaded view
|

Re: DGSH - directed graph shell. adding parallelism to shell & pipes

Samba - linux mailing list
I've had a preliminary look at dgsh, and I'm not overly taken with the
approach they took.
They have replaced the normal Unix pipe interface for stdin and stdout
with sockets, which means that the core utilities (and anything else you
want to use via pipes) has to be the modified version for dgsh. This
will mean having two versions of these programs which is a bit
problematic. There is also the question of maintenance - over time the
two version could drift apart as bug fixes and enhancements are applied.
If dgsh eventually becomes a normal part of a Unix/Linux distribution
then we could end up with two groups of incompatible programs, requiring
wrappers and other kludges to do something that has been easy since
about 1972.
I also note that their design only applies to stdin and stdout. The
stderr stream remains in its current form.

However, it got me wondering if there was another way, one that did not
require modifying the programs.
I think I could add a couple of extensions to VICI that would cover a
lot of dgsh's capabilities, and have some further advantages.

The first change would be to introduce named streams - the data flows
could be given a label. If a program connected to a named stream used
the name as a filename parameter, then VICI would substitute the label
with the path to a Unix named pipe. This would allow programs to connect
to multiple pipes. Of course it would not help for the cases where dgsh
has modified the actual interface to the program, such as grep having
multiple inputs and outputs, but you could create a modified grep with
that capability that would still be compatible with bash etc.

The second change is to introduce what I call a "manifold". This object
can have any number of stdin and stdout streams. It would have several
modes of operation:

     1. Sequential, where it reads from its first stream until its
        exhausted (closed), then reads from the second until that is
        finished, etc
     2. Merge, where any input is sent immediately to the output (line
        by line)
     3. Parallel, where reading blocks until something is ready on all
        the input streams. This would help to synchronise processing.
     4. Copy, where each input is sent to all the output streams
     5. Distribute, where the input lines are sent to the output streams
        in round-robin fashion.

The manifold would start a new thread for each of its output streams to
achieve the multiprocessing capability of dgsh.

Hence, I think it would have been possible to create dgsh without having
to fork the core utility programs to create an new set of incompatible
programs.

Brenton


On Wed, 2017-07-12 at 21:44 +1000, Brenton Ross via linux wrote:

> Steve,
>
> Thanks for posting this.
> I have been contemplating adding something similar to VICI.
> I will have to read up on this to see how they manage the interaction
> between the data flow and the flow of control.
>
> Cheers
> Brenton
>
> On Wed, 2017-07-12 at 09:35 +1000, steve jenkin via linux wrote:
>
> > This is an interesting take on a 25+ year-old idea of ‘Multipipes’ in the Unix shell. Much more than the ‘parallel’ command or managing a bunch of named pipes.
> > This one is based on ‘bash’ with another 12 or so commands modified to read & write to multiple pipes.
> >
> > One that appeals to me is ‘grep’. It takes 0-2 input streams and writes to 0-4 streams.
> > Available output streams (via arguments): matching files, non-matching files, matching lines, and non-matching lines
> >
> > The paper uses the same examples & diagrams as the website, but has much more discussion, a good history of the topic and 46 references.
> >
> > The design & examples are about a very Unix-y thing: streaming data and processing it just once. Not have to save intermediate files and reprocess them multiple times.
> > In a world of many cores and ‘Big Data’, being able to ‘naturally’ process data streams in parallel is an important new facility.
> > It’s even useful at the other end of the spectrum where I/O bandwidth & storage space is limited. On low-power, low-performance “IoT” devices like Single Board Computers and low-end smartphones.
> > Will we see a version built for ‘busybox’? It’s possible because of the design’s “coupling and cohesion” choices.
> >
> > They’ve thought about the design and implementation - limiting it to a limited syntax change to the (bash) shell.
> > Not sure how well tested & debugged it is, but because of the design you’d think there wouldn’t be many.
> >
> > regards
> > steve
> >
> > ———————————
> >
> > dgsh — directed graph shell
> > <https://www.spinellis.gr/sw/dgsh/#intro>
> > > The directed graph shell, dgsh (pronounced /dæɡʃ/ — dagsh), provides an expressive way to construct sophisticated and efficient big data set and stream processing pipelines using existing Unix tools as well as custom-built components.
> > > It is a Unix-style shell (based on bash) allowing the specification of pipelines with non-linear non-uniform operations.
> > > These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the operation's processing throughput.
> > >
> > > If you want to get a feeling on how dgsh works in practice, skip right down to the examples section.
> > >
> > > For a more formal introduction to dgsh or to cite it in your work, see:
> > > Diomidis Spinellis and Marios Fragkoulis. Extending Unix Pipelines to DAGs. IEEE Transactions on Computers, 2017. doi: 10.1109/TC.2017.2695447
> >
> >
> > Nuclear magnetic resonance processing - 12-stage pipeline run in parallel
> > <https://www.spinellis.gr/sw/dgsh/#NMRPipe>
> >
> >
> > Extending Unix Pipelines to DAGs
> > Diomidis Spinellis, Senior Member, IEEE
> > Marios Fragkoulis
> > <http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=7903579>
> > >
> > > Abstract—The Unix shell dgsh provides an expressive way to construct sophisticated and efficient non-linear pipelines using standard Unix tools, as well as third-party and custom-built components.
> > > Dgsh allows the specification of pipelines that perform non-uniform non-linear processing.
> > > These form a directed acyclic process graph, which is typically executed by multiple processor cores, thus increasing the processing task’s throughput.
> > > A number of existing Unix tools have been adapted to take advantage of the shell’s multiple pipe input/output capabilities.
> > > The shell supports visualization of the process graphs, which can also aid debugging.
> > > Dgsh was evaluated through a number of common data processing and domain-specific examples, and was found to offer an expressive way to specify processing topologies, while also generally increasing processing throughput.
> > >
> > > Index Terms—Process-level parallelism, Unix, pipeline, pipes and filters architecture
> >
> > --
> > Steve Jenkin, IT Systems and Design
> > 0412 786 915 (+61 412 786 915)
> > PO Box 38, Kippax ACT 2615, AUSTRALIA
> >
> > mailto:[hidden email] http://members.tip.net.au/~sjenkin
> >
> >
>
>


--
linux mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux
Reply | Threaded
Open this post in threaded view
|

Re: DGSH - directed graph shell. adding parallelism to shell & pipes

Samba - linux mailing list
On Fri, Jul 14, 2017 at 01:58:33PM +1000, Brenton Ross via linux wrote:
  | I've had a preliminary look at dgsh, and I'm not overly taken with the
  | approach they took.
  | They have replaced the normal Unix pipe interface for stdin and stdout
  | with sockets, which means that the core utilities (and anything else you
  | want to use via pipes) has to be the modified version for dgsh.

Does the "dgsh-wrap" tool they provide assist with interfacing with
existing stdin/stdout tools?
        https://www.spinellis.gr/sw/dgsh/dgsh-wrap.html

Could you just (ab)use socat to interface between stdin/stdout and
the dgsh sockets? I've used that technique elsewhere; socat is awesome,
(if complex to use):
        http://www.dest-unreach.org/socat/


  | However, it got me wondering if there was another way, one that did not
  | require modifying the programs.
  |
  | I think I could add a couple of extensions to VICI that would cover a
  | lot of dgsh's capabilities, and have some further advantages.
  |
  | The first change would be to introduce named streams - the data flows
  | could be given a label. If a program connected to a named stream used
  | the name as a filename parameter, then VICI would substitute the label
  | with the path to a Unix named pipe. This would allow programs to connect
  | to multiple pipes. Of course it would not help for the cases where dgsh
  | has modified the actual interface to the program, such as grep having
  | multiple inputs and outputs, but you could create a modified grep with
  | that capability that would still be compatible with bash etc.

If your platform provides /dev/fd/* (which Linux does), creative
use of shell redirection to fds in the invocation of the command,
and providing /dev/fd/.. as filenames may just work.
(This can fail when tools assume that a file is seekable.)


  | The second change is to introduce what I call a "manifold". This object
  | can have any number of stdin and stdout streams. It would have several
  | modes of operation:
  |
  |      1. Sequential, where it reads from its first stream until its
  |         exhausted (closed), then reads from the second until that is
  |         finished, etc
  |      2. Merge, where any input is sent immediately to the output (line
  |         by line)
  |      3. Parallel, where reading blocks until something is ready on all
  |         the input streams. This would help to synchronise processing.
  |      4. Copy, where each input is sent to all the output streams
  |      5. Distribute, where the input lines are sent to the output streams
  |         in round-robin fashion.
  |
  | The manifold would start a new thread for each of its output streams to
  | achieve the multiprocessing capability of dgsh.
  |
  | Hence, I think it would have been possible to create dgsh without having
  | to fork the core utility programs to create an new set of incompatible
  | programs.

That manifold idea is interesting.


As an implementation detail, personally I would probably experiment /
prototype that tool in python using an async I/O mechanism and some
generator trickery, rather than using a thread per stream.

(Or just write it in C++ and play with boost::asio; only using
threads as a thread pool behind the boost::asio io_service runner.
I digress :)

That's just a personal choice - YMMV.



cheers,
Luke.

--
linux mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux

attachment0 (205 bytes) Download Attachment
Reply | Threaded
Open this post in threaded view
|

Re: DGSH - directed graph shell. adding parallelism to shell & pipes

Samba - linux mailing list
Luke,

Thanks for your comments.

It would appear that "dgsh-wrap" is for the purpose of interfacing
normal stdin/stdout programs to dgsh, as you suggested.

Good point about not working if the program does a seek. I will need to
add something to the user guide to warn users if this ever gets
implemented. It would be hard for users to know if a program was going
to seek on one of its files.

The manifold should be relatively straightforward to implement with the
existing internals of VICI - its mostly just more threads and pipes that
make up the bulk of the runtime anyway. I suspect creating an icon for
it will be the most time consuming part.

Brenton

On Tue, 2017-07-25 at 13:41 +1000, Luke Mewburn wrote:

> On Fri, Jul 14, 2017 at 01:58:33PM +1000, Brenton Ross via linux wrote:
>   | I've had a preliminary look at dgsh, and I'm not overly taken with the
>   | approach they took.
>   | They have replaced the normal Unix pipe interface for stdin and stdout
>   | with sockets, which means that the core utilities (and anything else you
>   | want to use via pipes) has to be the modified version for dgsh.
>
> Does the "dgsh-wrap" tool they provide assist with interfacing with
> existing stdin/stdout tools?
> https://www.spinellis.gr/sw/dgsh/dgsh-wrap.html
>
> Could you just (ab)use socat to interface between stdin/stdout and
> the dgsh sockets? I've used that technique elsewhere; socat is awesome,
> (if complex to use):
> http://www.dest-unreach.org/socat/
>
>
>   | However, it got me wondering if there was another way, one that did not
>   | require modifying the programs.
>   |
>   | I think I could add a couple of extensions to VICI that would cover a
>   | lot of dgsh's capabilities, and have some further advantages.
>   |
>   | The first change would be to introduce named streams - the data flows
>   | could be given a label. If a program connected to a named stream used
>   | the name as a filename parameter, then VICI would substitute the label
>   | with the path to a Unix named pipe. This would allow programs to connect
>   | to multiple pipes. Of course it would not help for the cases where dgsh
>   | has modified the actual interface to the program, such as grep having
>   | multiple inputs and outputs, but you could create a modified grep with
>   | that capability that would still be compatible with bash etc.
>
> If your platform provides /dev/fd/* (which Linux does), creative
> use of shell redirection to fds in the invocation of the command,
> and providing /dev/fd/.. as filenames may just work.
> (This can fail when tools assume that a file is seekable.)
>
>
>   | The second change is to introduce what I call a "manifold". This object
>   | can have any number of stdin and stdout streams. It would have several
>   | modes of operation:
>   |
>   |      1. Sequential, where it reads from its first stream until its
>   |         exhausted (closed), then reads from the second until that is
>   |         finished, etc
>   |      2. Merge, where any input is sent immediately to the output (line
>   |         by line)
>   |      3. Parallel, where reading blocks until something is ready on all
>   |         the input streams. This would help to synchronise processing.
>   |      4. Copy, where each input is sent to all the output streams
>   |      5. Distribute, where the input lines are sent to the output streams
>   |         in round-robin fashion.
>   |
>   | The manifold would start a new thread for each of its output streams to
>   | achieve the multiprocessing capability of dgsh.
>   |
>   | Hence, I think it would have been possible to create dgsh without having
>   | to fork the core utility programs to create an new set of incompatible
>   | programs.
>
> That manifold idea is interesting.
>
>
> As an implementation detail, personally I would probably experiment /
> prototype that tool in python using an async I/O mechanism and some
> generator trickery, rather than using a thread per stream.
>
> (Or just write it in C++ and play with boost::asio; only using
> threads as a thread pool behind the boost::asio io_service runner.
> I digress :)
>
> That's just a personal choice - YMMV.
>
>
>
> cheers,
> Luke.


--
linux mailing list
[hidden email]
https://lists.samba.org/mailman/listinfo/linux