Argument filters and Streams ============================ This small example demonstrates three features of JIP. * Any interpreted language can be used to write executable blocks * The ``arg`` and ``else`` :ref:`template filters ` can be handy when it comes to the tedios task of argument parsing and interpretation * JIP's :ref:`stream dispatcher ` allows you to write streams to files and other processes in parallel. Other interpreters ------------------ The default interpreter used in JIP templates and scripts is ``bash``, but you can change the interpreter easily. For this, you simply specify the name or the path to the interpreter as a command block argument, for example, ``#%begin command perl`` will open a *perl* interpreted command block. Here is a full example where we use ``OCaml`` to drive our tool: .. code-block:: ocaml :emphasize-lines: 12,12 #!/usr/bin/env jip #Send a greeting from ocaml # # usage: # hello -n [-o ] # # Options: # -n, --name Your name # -o, --output The output # [default: stdout] #%begin command ocaml let o = ${output|arg('open_out "', '"')|else("stdout")};; Printf.fprintf o "Dear ${name}\n\n";; Printf.fprintf o "From the happy chambers of my spirit,\n";; Printf.fprintf o "I send forth to you, my smile.\n";; Printf.fprintf o "And, I pray that you shall carry it with you,\n";; Printf.fprintf o "as you travel each and every mile.\n";; In order to switch to OCaml, simply start the command block with ``#%begin command ocaml``. Argument filtering ------------------ In the example above, the tool exposes the ``output`` option to the user and defaults to ``stdout``. Here we use the ``arg`` and ``else`` :ref:`template filters ` to solve the problem. Somehow we have to wither select ``stdout`` or open the user specified file. The line that solves the problem is:: let o = ${output|arg('open_out "', '"')|else("stdout")};; What happens here? First of all, everything outside of ``${}`` is pure OCaml. Inside the braces, the following logic is applied. Take the value of the ``output`` options. If a value was assigned that is not a file stream and does not evaluate to false, pass it to ``arg`` (see :py:func:`jip.templates.arg_filter`) and insert the result. Otherwise, pass it through ``else``. The ``else`` argument of else is inserted if the passed value is a file stream or evaluates to false. In case the user specified an output file, we have to surround it with quotes and pass it to OCamls ``open_out`` function. Here we solve this by specifying the ``arg`` filters ``prefix`` and ``suffix`` arguments. This results in the following evaluated output. If the user did not specify an output file, ``arg`` takes the value and returns it unmodified. In this case, the ``else`` block is the one that generates the final result:: let o = stdout;; If an output file was specified, say ``myfile.txt``, the ``argfilter`` will surround it with the specified prefix and suffix and create the final result as:: let o = open_out "myfile.txt" Dispatching streams ------------------- Lets take the ``hello`` tool from the previous example and create a small, indeed not very useful, counting pipeline:: #!/usr/bin/env jip -p # Count words and lines # # usage: # counter greetings = run('hello', name=args['name']) line_count = bash('wc -l', input=greetings) full_count = bash('wc', input=greetings) .. note:: In this example, we do not open a block with ``#%begin pipeline`` to start implementing our pipeline. Instead, we pass the ``-p`` parameter to the *jip* interpreter in the shebang line. This switches to pipeline mode automatically. The pipeline in this example is rather straight forward. We take a single ``name`` argument. Then we call ``hello``, our OCaml tool from the previous example. Next, the output of ``hello``, called ``greetings``, is passed as input to two ``bash`` tools, one that does a full count, and one that only counts lines. Lets perform a dry run to see what happens. Don't worry, we will not start the pipeline, so there is no need to have OCaml installed:: $> ./counter.jip Joe -- --dry ... ############################################################... | ... +--------------------------------+--------+-----------------... | Name | State | ... +================================+========+=================... | greetings|line_count|full_coun | Hold | ... | t | | ... +--------------------------------+--------+-----------------... #################### | Job hierarchy | #################### greetings ├─line_count └─full_count #################### The dry run screen for this pipeline show is two things. First, the dependencies are correct as you can see from the *Job Hierarchy*. *Greetings* is the primary job and executed first and the other two jobs depend on it. Second, the *Job states* table shows only a single job called ``greetings|line_count|full_count``. This indicates that all three jobs have *streams* connecting them and they all have to be executed in parallel. In case you want to submit the pipeline to a compute cluster, this means all three jobs have to be executed on the same node in a single *cluster job*. The reason for this behavior is obvious. Our ``hello`` tool prints its output to ``stdout`` and the two counter tools read from ``stdin``. Lets modify the pipeline a little bit and write the output of our call to ``hello`` into a result file. The rest of the pipeline stays untouched: .. code-block:: python :emphasize-lines: 1,1 greetings = run('hello', name=args['name'], output='result.txt') line_count = bash('wc -l', input=greetings) full_count = bash('wc', input=greetings) Watch what happens when we perform a dry run:: $> ./counter.jip Joe -- --dry ... ###########################################################... | ... +--------------------------------+--------+----------------... | Name | State | ... +================================+========+================... | greetings | Hold | ... | line_count | Hold | result.txt ... | full_count | Hold | result.txt ... +--------------------------------+--------+----------------... #################### | Job hierarchy | #################### greetings ├─line_count └─full_count #################### The job hierarchy stays untouched as we did not modify any of the dependencies, but instead of a single job, all three tools are now executed in dedicated jobs. The table already shows the reason. The two counter jobs are now operating on the output file of the ``greetings`` job. ``greetings`` has to finish first, but the two counters can now be executed in two separate jobs. If this is beneficial or not depends on the tasks and on your compute infrastructure. For example, if one of the secondary jobs is able to work multi-threaded while the other one uses only a single CPU, it might be nice to split them into two dedicated job on your cluster. On the other hand, if your data stream is quiet large and the task is not very computationally intense, it might be better stream the data through all the jobs and submit only a single job to you your cluster. Now, what if you want the results of the initial job to be stored in your ``results.txt`` output file but you still would like to stream the data through to your other two jobs. In pure bash, you might solve it with bashs ``tee`` command. In JIP, you can achieve the same goal by adding a single line to your pipeline that ensures the stream pipe behaviour: .. code-block:: python :emphasize-lines: 5,5 greetings = run('hello', name=args['name'], output='result.txt') line_count = bash('wc -l', input=greetings) full_count = bash('wc', input=greetings) greetings | (line_count + full_count) The last line of the pipeline recreates the pipes, but **keeps the results file** as part of the stream. The output of ``greetings`` will be **streamed** to ``results.txt`` and to the two counter jobs.