What is a workflow

A workflow is a collection of components. In jflow, it is represented by a Python class inerhiting from jflow.workflow.Workflow. It lists all the inputs and parameters that should be requested to the final user and build the execution process by adding components and linking them to each others.

Where to add a new workflow

New wokflow must be added as a new python package in the workflows package. The implementation of a workflows must be written in the package __init__.py file. The developper can also create:

  • a components package, where all the workflow specific components can be stored,
  • a lib package to import specific libraries within its workflow,
  • a bin folder with the binaries used in the workflow.
jflow/
├── bin/
├── docs/
├── src/
├── workflows/
│   ├── myworkflow/       [ the new workflow package ]
│   │   ├── components/   [ specific components ]
│   │   ├── lib/          [ specific libraries ]
│   │   ├── bin/          [ specific binairies ]
│   │   └── __init__.py   [ the workflow implementation ]
│   ├── components/
│   ├── extparsers/
│   ├── __init__.py
│   ├── formats.py
│   └── types.py
├── applications.properties
└── README

The Workflow class

In jflow, a workflow is a class defined in the __init__.py file. In order to add a new workflow, the developper has to:

  • implement a class inheriting from the jflow.workflow.Workflow class,
  • overload the get_description() method to provide to the final user a description of the workflow,
  • overload the define_parameters() method to add the workflow inputs and parameters,
  • overload the process() method by adding components and setting their arguments,
  • link the components inputs and outputs.

The class skeleton is given by

from jflow.workflow import Workflow

class MyWorkflow (Workflow):

    def get_description(self):
        return "a description"

    def define_parameters(self, function="process"):
        # define the parameters

    def process(self):
        # add and link the components

Define parameters

The define_parameters() method is used to add workflow parameters and inputs. To do so, several methods are available. Once defined, the new parameters are available as object attibuts, thus they are accessible through self.parameter_name.

Several types of parameters can be added, all described in the following sections. All have two required positional arguments: name and help. The other arguments are optional and can be given to the method by using their keywords.

Parameters

Parameters can be added to handle a single element or a list of elements. Thus, the add_parameter() method can be used to force the final user to provide one and only one value, where the add_parameter_list() method allows the final user to give as many values he wants.

add_parameter()

Example

In the following example, a parameter named sequencer is added to the workflow. It has a list of choices and the default value is "HiSeq2000".

self.add_parameter("sequencer",
    		   "The sequencer type.", 
    		   choices = ["HiSeq2000", "ILLUMINA","SLX","SOLEXA","454","UNKNOWN"], 
    		   default="HiSeq2000")

Options

There are two positional arguments: name and help. All other options are keyword options

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default - false None The default parameter value. It's type depends on the parameter type.
type string false "str" The parameter type. The value provided by the final user will be casted and checked against this type. All built-in Python types are available "int", "str", "float", "bool", "date", ... To create customized types, refere to the Add a data type documentation.
choices list false [] A list of the allowed values.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
add_to string false None If this parameter is part of a multiple parameter, add_to allows to define to which "parent" parameter it should be linked to.

add_parameter_list()

The add_parameter_list() method takes the same arguments as add_parameter(). However, adding this parameter, the final user will be allowed to enter multiple values for this parameter and the object attribut self.parameter_name will be settled as a Python list.

Inputs

Just like parameters, inputs can be added to handle a single file or a list of files. Thus, the add__input_file() method can be used to force the final user to provide one and only one file, where the add__input_file_list() method allows the final user to give as many files as he wants.

add_input_file()

Example

In the following example, an input named reads is added to the workflow. The provided file is required and should be in fastq format. No file size limitation is specified.

self.add_input_file_list("reads", 
                         "Which read files should be used", 
                         file_format="fastq", 
                         required=True)

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default path value.
file_format string false "any" The file format is checked before running the workflow. To create customized format, refere to the Add a file format documentation.
type string false "inputfile" The type can be "inputfile", "localfile", "urlfile" or "browsefile". An "inputfile" allows the final user to provide a "localfile" or an "urlfile" or a "browsefile". A "localfile" restricts the final user to provide a path to a file visible by jflow. An "urlfile" only permits the final user to give an URL as input, where a "browsefile" force the final user to upload a file from its own computer. This last option is only available from the GUI and is considered as a "localfile" from the command line. All the uploading process is handled by jflow.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.
add_to string false None If this parameter is part of a multiple parameter, add_to allows to define to which "parent" parameter it should be linked to.
size_limit string false "0" Which maximum file size is allowed. If the value is "0", the file size allowed is unlimited. The given value should also provides the file size units between "bytes", "Kb", "Mb", "Gb", "Tb", "Pb", "Eb" and "Zb". A value of 10Mb will restrict the user to upload a file of 10 Mega Bytes.

add_input_file_list()

This method takes the same arguments as add_input_file(). However, adding this parameter, the final user will be allowed to provide multiple files and the object attribut self.parameter_name will be settled as a Python list.

add_input_directory()

The add_input_directory() method allows the user to select files from a specific directory. This kind of input can be useful for tools outputing not only files but an organized directory. The parameter get_files_fn specify the function that will be used to retrieve the files. This method can take as many arguments as required, but the first argument has to be a string representing the folder path. By default all files will be selected. From the workflow process() function, the files can be retrieved by using the get_files() method.

Example

In the following example, the add_input_directory() method is used to parse a directory and retrieve only fasta files inside this directory. get_files() will browse the directory and get all fasta files.

import os
from jflow.Workflow import Workflow

def fasta_files(folder):
    res = []
    for file in os.listdir(folder):
        if file.endswith(".fasta"):
            res.append(file)
    return res

class MyWorkflow(Workflow):
    def define_parameters(self, function="process"):
        self.add_input_directory("fastadir", "Path to folder with fasta files", 
            get_files_fn=fasta_files)

    def process(self):
        # to retrieve the files
        for fastafile in self.fastadir.get_files():
            # do something

Options

There are two positional argument : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the parameter. The parameter value is accessible within the workflow object through the attribute named self.parameter_name.
help string true None The parameter help message.
default string false None The default path value.
get_files_fn function false - get_files_fn will be the method called when executing param.get_files(). All argument from get_files() will be used as arguments in get_files_fn
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name).
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.

Multiple parameters

Jflow offers, to the developper, the possibility to structure the input data by using the notion of multiple parameters. A multi parameter is a collection of parameters linked together. Just like for parameters and inputs, it can be added to handle a single collection or a list of collections. Thus, the add_multiple_parameter() method can be used to force the final user to provide one and only one collection, where the add_multiple_parameter_list() method allows the final user to give as many collection he wants. To add a parameter within the multiple parameter, it only requires to set the option add_to of any methods previously described. The accessible object attribut self.multi_parameter_name is then a Python dictionary gathering all the values of the different parameters under the format {"sub_parameter1":value}

add_multiple_parameter()

Example

The following example creates a multiple parameter named library which contains two input files R1 and R2 and a sequencer parameter.

self.add_multiple_parameter("library", "Library.", required=True)
self.add_input_file("R1", "Path to R1 file.", required=True, add_to="library")
self.add_input_file("R2", "Path to R2 file.", add_to="library")
self.add_parameter("sequencer", "The sequencer type.", choices=["HiSeq2000", 
    "ILLUMINA", "UNKNOWN"], default="HiSeq2000", add_to="library")

Options

There are two positional arguments : name and help. All other options are keyword options.

Name Type Required Default value Description
name string true None The name of the multi parameter. The parameter value is accessible within the workflow object through the attribute named self.multi_parameter_name. And its sub parameters using self.multi_parameter_name["sub_parameter_name"].
help string true None The parameter help message.
required boolean false false Wether or not the parameter can be ommitted.
flag string false None The command line flag (if the value is None, the flag will be --name). The sub parameters can be set as following --name sub1=... sub2=...
group string false "default" The value is used to group a list of parameters in sections. The group is used in both command line and GUI.
display_name string false None The parameter name that should be displayed on the final form.

add_multiple_parameter_list()

This method takes the same arguments as add_multiple_parameter(). However, adding this parameter, the final user will be allowed to provide multiple collection and the object attribut self.multi_parameter_name will be settled as a Python list of Python dictionary.

Exclusion rules

Jflow offers the possibility to exclude some rules from each otehrs. To do so, the method add_exclusion_rule() is available.

add_exclution_rule()

Example

In the following example, the final user will not be allowed to provide both fasta_file and fastq_file parameters.

self.add_input_file("fasta_file", "Path to the fasta file.", format="fasta")
self.add_input_file("fastq_file", "Path to the fastq file.", format="fastq")
self.add_exclution_rule("fasta_file", "fastq")

Options

The method accept the following options

Name Type Required Default value Description
*args2exclude string true None The name of the parameter to exclude.

Process

The process() method is in charge of building the workflow by adding components (using the method add_component()) and linking their inputs and their outputs. A component is a class representing a workflow step. See the component documentation for more information.

add_component()

The add_component() method add a component to the workflow by building a jflow.component.Component object and returning it. All attributs defined within this object, such as the outputs, are then available from the workflow and can be used as inputs of other components.

Example

In the following example, the first component BWAIndex is built and returned in the bwaindex object. The output bwaindex.databank is accessible as an object attribut and can be used as input of the BWAmem component. This example is extracted from the Quick start.

# index the reference genome
bwaindex = self.add_component("BWAIndex", [self.reference_genome])
# align reads against the indexed genome
bwamem = self.add_component("BWAmem", [bwaindex.databank, self.reads])

Options

There is one positional argument : component_name. All other options are keyword options.

Name Type Required Default value Description
component_name string true None The component class name to add to the workflow.
args list false [] The component's arguments (see here for more details).
kwargs dict false {} The component's keyword arguments (see here for more details).
component_prefix string false "default" The prefix is used to name the component at the execution. The prefix allows to add multiple components of the same class within the same workflow.

Other methods

Pre process

pre_process() is executed before running the process method. Unlike process, this method does not allow to add components, but can be useful when implementing an application requiring to prepare some data before running the workflow (insert / recover information from a database, add metadata to the workflow, ...).

Post process

post_process() is executed right after the process method and cannot be used to add components. This method can be useful to perform some database transactions and to synchronize data.

Set to address

set_to_address() overwrite the value to_address defined in the jflow configuration file.

Options

There is one required argument : to_address.

Name Type Required Default value Description
to_address string true None The email address to use to send an email to the user once the workflow is completed.

Set email subject

set_subject() overwrite the value subject defined in the jflow configuration file.

Options

There is one required argument : subject.

Name Type Required Default value Description
subject string true None The email subject to use for the email sent to the user once the workflow is completed.

Set email message

set_message() overwrite the value message defined in the jflow configuration file.

Options

There is one required argument : message.

Name Type Required Default value Description
message string true None The email message to use for the email sent to the user once the workflow is completed.

Get shared resources

The method get_resource(), giving a specific resource, returns the defined value within the resource section of the jflow configuration file.

Options

There is one required argument : resource.

Name Type Required Default value Description
resource string true None The resource name for which is requested the configured value.