The JIP API =========== The JIP platform is mostly written in Python, except for the stream dispatcher, which is written in C and integrated as a Python Extension. This documentation covers the JIP API and describes the basic modules and classes that make up the system. In a lot of cases it will not be necessary to read and understand the full API reference. It might come in handy though when you are in the situation of extending the system, for example, adding support for your own cluster, or if you want to dig deeper and see how things are created. In addition to the full API reference linked below, this chapter contains an overview and description of how to use the API for a few specific use-cases. This will hopefully cover the basic of using JIP as a library rather than a command line utility in your own tool. We will go over the basic process of loading and instantiating a pipeline without your own programs, configuring parts of the system at runtime, and how to run jobs locally or submit them to a remote cluster. Creating pipelines ------------------ One of the first things you might want to do is to actually run one of your tools or create a pipeline. Both are very similar in nature. Running any tool always start with adding the tool to a pipeline. The pipeline is then expanded using its :py:meth:`~jip.pipelines.Pipeline.expand` method and converted to a set of jobs. These jobs can then be executed in order or submitted to a compute cluster. Lets start with creating a pipeline instance. You can import the necessary classes and function directly from the ``jip`` module:: >>> from jip import * This will load most of the important parts of the API into your namespace. Starting from here, you can create a pipeline instance and add any tools that you want to execute. Usually you don't need to create instances of your tools, but reference them by name. If the tool can not be found, a :py:exc:`jip.tools.ToolNotFoundException` is raised. In that case, you either misspelled the tool name or you have to configure the ``jip.scanner`` instance in order to add custom search paths (see :ref:`api_scanner` on how to customize the search paths):: >>> p = Pipeline() >>> p.run("unknown") Traceback (most recent call last): File "", line 1, in File "jip/pipelines.py", line 180, in run tool = find(_tool_name) File "jip/tools.py", line 410, in find raise ToolNotFoundException("No tool named '%s' found!" % name) ToolNotFoundException: No tool named 'unknown' found! In case you want to be sure, catch and handle the ``ToolNotFoundException``, but typically this is a serious issue and the exception should be raised up. Now lets go through the process once more, this time adding a ``bash`` run to the pipeline:: >>> p = Pipeline() >>> p.bash('ls') bash >>> print len(p) 1 We now have a pipeline graph with exactly one :class:`~jip.pipelines.Node`. Running jobs locally -------------------- Pipeline instances represent the execution graph and its properties, but they are not meant to be executed directly. We have to convert the pipeline nodes into :class:`jobs ` that can be executed wither locally or send to a remote cluster. The first step here is to :py:func:`create ` the job instances:: >>> jobs = create_jobs(p) >>> assert len(jobs) == 1 In the background, the pipeline was *expanded*, options were rendered, and the job were validated. In case one of the tools in the pipeline was misconfigured and the validation would step would raise a ``ValidationError``. These are rather common, especially when you pass along user input, so you might want to run the ``create_jobs`` call in a ``try/except`` block to catch any exceptions. Now that we have a list of jobs to execute, you might think we are ready to go, but unfortunately that is not yet the case. The call to ``create_jobs`` returns an ordered list of *all* the jobs in the pipeline graph, but we do not want to start all of them independently. We might also want to perform further checks on the jobs. Some of them might already be completed and unless we want to force execution, we do not have to send them again. We also want to *group* jobs. The main reason is that JIP allows you to create data streams between jobs. That means the jobs involved have to run in parallel and their input and output streams have to be handled appropriately. These sets of jobs form *groups* and we only have to start the **primary job of each group**. The other jobs will be started automatically in the right order and with the right I/O setup. The JIP API, specifically, the :py:mod:`jip.jobs` module, provides a set of low level functions to perform the grouping and additional checks, but it also contains a few helper functions that wrap around common use cases. We are trying to implement one of these common ones, running a pipeline. Therefore we are lucky and can leverage some of the helpers:: >>> for exe in create_executions(jobs, check_outputs=True): ... print "Running %s:" % exe.name, ... if exe.completed: ... print "Skipped" ... elif run_job(exe.job): ... print "Success" ... else: ... print "Failure" ... break ... Running bash: Success >>> What happens is that we iterate over all available execution and run all jobs that are not yet in completed state. In case of a failure, we break the loop and stop executing. Save and submit jobs -------------------- The same rules that apply to running jobs locally also apply when you want to *submit* jobs to a remote cluster, but we need to do a little bit more work. In order to get jobs submitted, we have to store them in a JIP database that is accessible by the cluster before we can use in instance of your compute cluster to actually submit the job. The JIP database location and the JIP cluster instance can both be configured within the JIP configuration. In the command line application that is shipped with JIP, this is the way to modify both the database and the cluster. When using the JIP API directly, you can leverage the same functionality and we provide examples of how you can get the pre-configured database and cluster instance :ref:`below `. For this example, we will go through the process of manually configuring both the cluster instance as well as the database. The initial process of creating the jobs for a given pipeline is the same as if you want to run the jobs locally:: >>> jobs = create_jobs(p) >>> assert len(jobs) == 1 If you want to use the preconfigured configuration, you only have to modify the call to ``create_executions`` to ensure that jobs are saved in the database, and use ``submit_job`` rather than ``run_job``: .. code-block:: python for exe in create_executions(jobs, check_outputs=True): print "Submitting %s:" % exe.name, if exe.completed: print "Skipped" elif submit_job(exe.job): print "Submitted job %s with remote id %s" % (exe.job.id, exe.job.cluster_id) else: print "Failure" break You might want to add a global profile to your run, for example, to specify the queue that is used during submission. Use the ``jip.profiles`` module to load a preconfigured profile: .. code-block:: python import jip.profiles profile = jip.profiles.get("default") for exe in create_executions(jobs, check_outputs=True, profile=profile): ... Customize the database location ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ Without any further specification, the JIP database is configured by the user in the JIP configuration files. You can however use the API to alter and modify the location. The database reference is globally stored in the ``jip.db`` module. If you want to use a different location, make sure you initialize the ``jip.db`` module properly: .. code-block:: python import jip.db jip.db.init("") .. _api_jip_configuration: Use the JIP configuration ------------------------- TODO: Add more docs for the api based configuration .. _api_scanner: The Scanner ----------- TODO: Document the tool scanner JIP modules and Classes ----------------------- .. toctree:: :maxdepth: 2 cli cluster config db executils jobs profiles options pipelines templates tools utils