bioprov.src package

Submodules

bioprov.src.config module

Contains the Config class and other package-level settings.

Define your configurations in the ‘config’ variable at the end of the module.

class bioprov.src.config.BioProvDB(path)

Bases: tinydb.database.TinyDB

Inherits from tinydb.TinyDB

Class to hold database configuration and methods.

__init__(path)

Create a new instance of TinyDB.

clear_db(confirm=False)

Deletes the local BioProv database. :param confirm: :return:

class bioprov.src.config.Config(db_path=None, threads=0)

Bases: object

Class to define package level variables and settings.

__init__(db_path=None, threads=0)
Parameters
  • db_path – Path to database file. Default is bioprov_directory/db.json

  • threads – Number of threads. Default is half of processors.

clear_db(confirm=False)

Deletes the local BioProv database. :param confirm: :return:

create_provstore_file(user=None, token=None)
property db
db_all()
Returns

List all items in BioProv database.

property db_path
property logger
property provstore_api
property provstore_file
property provstore_token
property provstore_user
read_provstore_file()

Attempts to read self.provstore_file. Will prompt to create one if unable to retrieve credentials.

Returns

Updates self.provstore_user and self.provstore_token.

serializer()
class bioprov.src.config.Environment

Bases: object

Class containing provenance information about the current environment.

__init__()

Class constructor. All attributes are empty and are initialized with self.update()

property actedOnBehalfOf
serializer()
update()

Checks current environment and updates attributes using the os.environ module. :return: Sets attributes to self.

bioprov.src.files module

Contains the File and SeqFile classes and related functions.

class bioprov.src.files.Directory(path, tag=None)

Bases: object

Class for holding information about directories.

__init__(path, tag=None)
add_files_to_object(object_, kind='files')

Add files or subdirs in self to object_, can be either a Sample or Project. :param object_: bioprov.Project or bioprov.Sample :param kind: Whether to add files or subdirectories. :return: Updates object_.files

property exists
get_files()
get_subdirs()
replace_path(old_terms, new, warnings=False)

Replace the current File path.

Usually used for switching between users.

Parameters
  • old_terms – Terms to be replaced in the path.

  • new – New term.

  • warnings – Whether to warn if sha256 checksum differs or file does not exist.

Returns

Updates self.

serializer()
class bioprov.src.files.File(path, tag=None, attributes=None, _get_hash=True)

Bases: object

Class for holding files and file information.

__init__(path, tag=None, attributes=None, _get_hash=True)
Parameters
  • path – A UNIX-like file _path.

  • tag – optional tag describing the file.

  • attributes – Miscellaneous attributes.

property entity
property exists
property raw_size
replace_path(old_terms, new, warnings=False)

Replace the current File path.

Usually used for switching between users.

Parameters
  • old_terms – Terms to be replaced in the path.

  • new – New term.

  • warnings – Whether to warn if sha256 checksum differs or file does not exist.

Returns

Updates self.

serializer()
property sha256
property size
class bioprov.src.files.SeqFile(path, tag=None, format='fasta', parser='seq', document=None, import_records=False, calculate_seqstats=False)

Bases: bioprov.src.files.File

Class for holding sequence file and sequence information. Inherits from File.

This class support records parsed with the BioPython.SeqIO module.

__init__(path, tag=None, format='fasta', parser='seq', document=None, import_records=False, calculate_seqstats=False)
Parameters
  • path – A UNIX-like file _path.

  • tag – optional tag describing the file.

  • format – Format to be parsed by SeqIO.parse()

  • parser – Bio parser to be used. Can be ‘seq’ (default) to be parsed by SeqIO or ‘align’ to be parsed with AlignIO.

  • document – prov.model.ProvDocument.

  • import_records – Whether to import sequence data as Bio objects

  • calculate_seqstats – Whether to calculate SeqStats

property generator
import_records(**kwargs)
Parameters

kwargs – Parameters to pass to the SeqFile._calculate_seqstats() function.

Returns

Import records into self.

property max_seq
property min_seq
seqfile_formats = ('fasta', 'clustal', 'fastq', 'fastq-sanger', 'fastq-solexa', 'fastq-illumina', 'genbank', 'gb', 'nexus', 'stockholm', 'swiss', 'tab', 'qual')
property seqstats
serializer()
class bioprov.src.files.SeqStats(number_seqs: int, total_bps: int, mean_bp: float, min_bp: int, max_bp: int, N50: int, GC: float)

Bases: object

Dataclass to describe sequence statistics.

GC: float
N50: int
__init__(number_seqs: int, total_bps: int, mean_bp: float, min_bp: int, max_bp: int, N50: int, GC: float) None
max_bp: int
mean_bp: float
min_bp: int
number_seqs: int
total_bps: int
bioprov.src.files.calculate_N50(array)

Calculate N50 from an array of contig lengths. https://github.com/vikas0633/python/blob/master/N50.py

Based on the Broad Institute definition: https://www.broad.harvard.edu/crd/wiki/index.php/N50 :param array: list of contig lengths :return: N50 value

bioprov.src.files.deserialize_files_dict(files_dict)

Deserialize a dictionary of files in JSON format. :param files_dict: dict of dicts. :return: dict of File instances.

bioprov.src.files.seqrecordgenerator(path, format, parser='seq', warnings=False)
Parameters
  • path – Path to file.

  • format – format to pass to SeqIO.parse().

  • parser – Whether to import records with SeqIO (default) or AlignIO

  • warnings – Whether to warn if sha256 checksum differs or file does not exist.

Returns

A generator of SeqRecords.

bioprov.src.main module

Main source module. Contains the main BioProv classes.

Activity classes:
  • Program

  • Parameter

  • Run

Entity classes:
  • Project

  • Sample

This class also contains functions to read and write objects in JSON and tab-delimited formats.

class bioprov.src.main.Parameter(key=None, value='', tag=None, cmd_string=None, description=None, kind=None, keyword_argument=True, position=- 1)

Bases: object

Class holding information for parameters.

__init__(key=None, value='', tag=None, cmd_string=None, description=None, kind=None, keyword_argument=True, position=- 1)
Parameters
  • key – Key of the parameter, e.g. ‘-h’ for help command.

  • value – Value of the parameter.

  • tag – A tag of the parameter.

  • cmd_string – String representation of the parameter in a command.

  • description – description of the parameter.

  • kind – Kind of parameter. May be ‘input’, ‘output’, ‘misc’, or None.

  • keyword_argument – Whether the parameter is a keyword argument. Keyword arguments have a key, which is used to build the program’s command. If this is false, it is assumed that the parameter is a positional argument, and ‘position’ will indicate it’s index if the command line was split as a list.

  • position – Index of insertion of parameter in command-line if it is a positional argument.

serializer()
class bioprov.src.main.PresetProgram(name=None, params=None, sample=None, input_files=None, output_files=None, preffix_tag=None, extra_flags=None)

Bases: bioprov.src.main.Program

Class for holding a preset program and related functions.

A WorkflowStep instance inherits from Program and consists of an instance of Program with an associated instance of Sample or Project.

__init__(name=None, params=None, sample=None, input_files=None, output_files=None, preffix_tag=None, extra_flags=None)
Parameters
  • name – Instance of bioprov.Program

  • params – Dictionary of parameters.

  • sample – An instance of Sample or Project.

  • input_files – A dictionary consisting of Parameter keys as keys and a File.tag as value, where File.tag is a string that must be a key in self.sample.files with a corresponding existing file.

  • output_files – A dictionary consisting of Parameter keys as keys and a tuple consisting of (File.tag, suffix) as value. File.tag will become a key in self.sample.files and the its value will be the sample_name + suffix.

  • preffix_tag – A value in the input_files argument, which corresponds to a key in self.sample.files. All file names of output files will be stemmed from this file, hence ‘preffix’.

  • extra_flags – A list of command line parameters (strings), known as flags or switches, to add to the program’s command.

create_func(sample, preffix_tag=None)
Parameters
  • sample – Instance of Sample to create the function for.

  • preffix_tag – Argument to be passed to self._parse_output_files()

Returns

Creates Program function for Sample.

generate_cmd()

TODO: improve this function

Generates a wildcard command string, independent of samples. :return: Updates self.cmd.

run(sample=None, preffix_tag=None, **kwargs)

Runs PresetProgram for sample. :param sample: Instance of bioprov.Sample. :param preffix_tag: Preffix tag to self.create_func() :param kwargs: See help of Program.run() :return:

validate_program()

Checks type of self :return:

validate_sample()

Checks type of self.sample. :return:

class bioprov.src.main.Program(name=None, params=None, tag=None, path_to_bin=None, version=None, cmd=None, sample=None)

Bases: object

Class for holding information about programs.

__init__(name=None, params=None, tag=None, path_to_bin=None, version=None, cmd=None, sample=None)
Parameters
  • name – Name of the program being called.

  • params – Dictionary of parameters.

  • tag – Tag to call the program if different from name. Default: self.name

  • path_to_bin – A full _path to the program’s binary. Default: get from self.name.

  • cmd – A command string to call the program. Default: build from self._path and self.params.

  • version – Version of the program.

  • sample – Instance of bioprov.Sample

add_parameter(parameter, _generate_cmd=True)

Adds a parameter to the current instance and updates the command.

Parameters
  • parameter – an instance of the Parameter class.

  • _generate_cmd – Refreshes self.cmd when a Parameter is added.

Returns

Updates self.params and self.cmd if _generate_cmd is True.

add_runs(runs)

Sample method to add runs. :param runs: See input to add_runs function. :return: Adds runs to Sample

property duration
property end_time
property finished
generate_cmd()

Generates command string to execute.

Returns

command string

run(sample=None, suppress_stdout=True, suppress_stderr=True, force_print=False)

Runs the process. :param sample: An instance of bioprov.Sample. :param suppress_stdout: Whether to print stdout of the program. :param suppress_stderr: Whether to print stderr of the program. :param force_print: Whether to force printing the output of the program. :return: An instance of the Run class.

property runs
serializer()
property start_time
property status
property stderr
property stdin
property stdout
class bioprov.src.main.Project(tag=None, samples=None, db=None, auto_update=False, log_to_file=False)

Bases: object

Class which holds a dictionary of Sample instances, where each key is the sample name.

__init__(tag=None, samples=None, db=None, auto_update=False, log_to_file=False)

Initiates the object by creating a sample dictionary. :param samples: An iterator of Sample objects. :param tag: A tag to describe the Project. :param db: path to TinyDB to store project in JSON format. :param auto_update: Whether to auto_update the BioProvDB record. Disabled by default. :param log_to_file: Whether to log the Project to a File. You can define this later with the self.start_logging() method.

add_files(files)

Adds Files to self.files. See documentation to bioprov.src.main.add_files().

Parameters

files – Dict, list, or File instance.

Returns

Updates self.files

add_programs(programs)

Add programs to self. See documentation to bioprov.src.main.Progarm :param programs: Dict, list, or Program instance. :return: Updates self.programs

auto_update_db()

Updates the database if auto_update is True.

build_sample_dict(constructor)

Build sample dictionary from passed constructor. :param constructor: Iterable or NoneType :return: dictionary of sample instances.

property bundle
property entity
static is_iterator(constructor)

Checks if the constructor passed is a valid iterable, or None. :param constructor: constructor object used to build a Project instance. :return:.

static is_sample_and_name(sample)

Checks if an object is of the Sample class. Name the sample if it isn’t named. :param sample: an object of the Sample class. :return:

items()
keys()
property name
property namespace_preffix
query_db(db=None)
replace_paths(old_terms, new, warnings=False)

Runs File.replace_path(old_terms, new) on all Files in the project and each Sample.

For more information see File.replace_path() documentation.

Parameters
  • old_terms – old terms to be replaced.

  • new – new term.

  • warnings – whether to activate warnings.

Returns

Updates all Files associated with self.

run_programs()

Runs all programs in self.programs in order.

Returns

property samples
serializer()
property sha256
start_logging(log_file=None, level=20, _custom_start_message=None)

Starts logging Project to File.

Parameters
  • log_file – Path to log file. If None will be defined automatically.

  • level – Logging level.

  • _custom_start_message – Custom starting message to start the log.

Returns

Creates logger attributes and refreshes bp.config.logger

to_csv(path_=None, sep=',', **kwargs)

Writes a tab-delimited file of sample files and attributes using the to_df method. :return:

to_df()

Creates a Pandas DataFrame from Sample files and attributes. :return: pd.DataFrame

to_json(_path=None)

Exports the Project as JSON. Similar to Sample.to_json() :param _path: JSON output file _path. :return:

update_db(db=None)
values()
class bioprov.src.main.Run(program, sample=None)

Bases: object

Class for holding Run information about a selected Program.

__init__(program, sample=None)
Parameters
  • program – An instance of bioprov.Program.

  • sample – An instance of bioprov.Sample

run()

Runs process for the Run instance. Will update attributes accordingly. :return: self

serializer()
property status
class bioprov.src.main.Sample(name=None, tag=None, files=None, directory=None, attributes=None)

Bases: object

Class for holding sample information and related files and programs.

__init__(name=None, tag=None, files=None, directory=None, attributes=None)
Parameters
  • name – Sample name or ID.

  • tag – optional tag describing the sample.

  • files – Dictionary of files associated with the sample.

  • directory – A bioprov.Directory associated with the sample

  • attributes – Dictionary of any other attributes associated with the sample.

add_files(files)

Sample method to add files. :param files: See input to add_files function. :return: Adds files to Sample

add_programs(programs)

Adds program(s) to self. Must be an instance or iterable of bioprov.Program.

Parameters

programs – bioprov.Program iterator or instance, value where key is the program name and value is a bp.Program instance.

Returns

Updates self by adding the programs to object.

auto_update_db()
property directory
run_programs()

Runs self._programs in order. :return:

serializer()

Custom serializer for Sample class. Serializes runs, programs, and files attributes. :return:

to_json(_path=None)

Exports the Sample as JSON. Similar to Project.to_json() :param _path: JSON output file path. :return:

to_series()

Creates a pd.Series object from the sample files and attributes.

Returns

pd.Series

bioprov.src.main.deserialize_programs_dict(programs_dict, sample)

Deserialize programs from JSON format

Parameters
  • programs_dict – dictionary of serialized Programs in JSON format

  • sample – instance of bioprov.Sample

Returns

dictionary of Program instances

bioprov.src.main.deserialize_runs_dict(runs_dict, programs_dict, tag, sample)

Deserialize runs in JSON format.

Parameters
  • runs_dict – dictionary of Runs in JSON format.

  • programs_dict – dictionary of Program instances to be updated.

  • tag – Tag of each program.

  • sample – Sample to be updated.

Returns

bioprov.src.main.dict_to_sample(json_dict)

Converts a JSON dictionary to a sample instance. :param json_dict: output of sample_from_json. :return: a Sample instance.

bioprov.src.main.from_df(df, index_col=0, file_cols=None, sequencefile_cols=None, tag=None, source_file=None, import_data=False)

Pandas-like function to build a Project object.

By default, assumes the sample names or ids are in the first column, else they should be specified by ‘index_col’ arg.

‘’’ samples = from_df(df_path, sep=” “)

type(samples) # bioprov.Sample.Project.

You can select columns to be added as Files or SequenceFile instances.

‘’’

Parameters
  • df – A pandas DataFrame

  • index_col – A column to be used as index. Must be in df_path.columns. If int is passed, will get it from columns.

  • file_cols – Columns containing Files.

  • sequencefile_cols – Columns containing SequenceFiles.

  • tag – A tag to describe the Project.

  • source_file – The source file used to read the dataframe.

  • import_data – Whether to import data when importing SequenceFiles

Returns

a Project instance

bioprov.src.main.from_json(json_file, kind='Project', replace_path=None, replace_home=False)

Imports Sample or Project from JSON file.

Parameters
  • json_file – A JSON file created by Sample.to_json()

  • kind – Whether to create a Sample or Project instance.

  • replace_path – A tuple or list with two strings. The first will be the old path to be replaced, and the second will be the new.

  • replace_home – If True, will run replace_path automatically for previous HOME paths.

Returns

a Sample or Project instance.

bioprov.src.main.generate_param_str(params)

TODO: improve this docstring Generates a string from a dictionary of parameters :param params: Dictionary of parameters. :return:

bioprov.src.main.json_to_dict(json_file)

Reads dict from a JSON file. :param json_file: A JSON file created by Sample.to_json() :return: a dictionary (input to dict_to_sample())

bioprov.src.main.load_project(tag, db=None, import_records=False)

Loads Project from the BioProvDatabase set in the config.

Parameters
  • tag – Tag of the Project to be loaded.

  • db – Path to BioProvDB file. Default is set in the config module. (use the bioprov –show_db command).

  • import_records – Whether to import the sequence records. Unnecessary if this data is already recorded in the Project.

Returns

Instance of Project.

bioprov.src.main.parse_params(params)

Function used to parse parameter input. :param params: An instance or iterator of Parameter instances or a dictionary. :return: Parsed parameters to serve as attribute to a Program or Run instance.

bioprov.src.main.read_csv(df_path, sep=',', **kwargs)
Parameters
  • df_path – Path of dataframe.

  • sep – Separator of dataframe.

  • kwargs – Any kwargs to be passed to from_df()

Returns

A Project instance.

bioprov.src.main.to_json(object_, dictionary, _path=None)

Exports the Sample or Project as JSON. :return: Writes JSON output

bioprov.src.main.write_json(dict_, _path)

Writes dictionary to JSON file. :param dict_: JSON dictionary. :param _path: String with _path to JSON file. :return: Writes JSON file.

bioprov.src.prov module

Module containing base provenance attributes.

This module extracts system-level information, such as user and environment settings, and stores them. It is invoked to export provenance objects.

class bioprov.src.prov.BioProvDocument(project, add_attributes=False, add_users=True, _add_project_namespaces=True, _iter_samples=True, _iter_project=True)

Bases: object

Class containing base provenance information for a Project.

__init__(project, add_attributes=False, add_users=True, _add_project_namespaces=True, _iter_samples=True, _iter_project=True)

Constructs the W3C-PROV document for a project.

Parameters
  • project (Project) – instance of bioprov.src.Project.

  • add_attributes (bool) – whether to add object attributes.

  • add_users (bool) – whether to add users and environments.

  • _add_project_namespaces (bool) –

  • _iter_samples (bool) –

  • _iter_project (bool) –

property dot
property provn
property provstore_document
upload_to_provstore(api=None)

Uploads self.ProvDocument. to ProvStore (https://openprovenance.org/store/)

Parameters

api – provstore.api.Api

Returns

Sends POST request to ProvStore API and updates self.ProvDocument if successful.

write_provn(path=None)

Writes PROVN output of document. :param path: Path to write file. :return: Writes file.

bioprov.src.workflow module

Contains the Workflow class and related functions.

class bioprov.src.workflow.Step(preset_program, default=False, description='', kind='Sample')

Bases: bioprov.src.main.PresetProgram

Class for holding workflow steps.

Steps are basically PresetProgram instances but they do not have any Sample associated with them, and always generate command strings.

__init__(preset_program, default=False, description='', kind='Sample')
Parameters
  • preset_program – Instance of bioprov.PresetProgram.

  • default – Whether the Step runs by default.

  • description – Description of the step program.

  • kind – Whether the Step is associated with a s Sample or Project.

class bioprov.src.workflow.Workflow(name=None, description=None, input=None, input_type='dataframe', index_col='sample-id', file_columns=None, file_extensions=None, steps=None, parser=None, tag=None, verbose=None, threads=None, sep='\t', log=None, _log_to_file=True, update_db=False, upload_to_provstore=False, write_provn=False, write_pdf=False, **kwargs)

Bases: object

Workflow class. Used to build workflows for BioProv command line.

A workflow runs a series of steps (bioprov.Program) on a set of samples (bioprov.Project).

__init__(name=None, description=None, input=None, input_type='dataframe', index_col='sample-id', file_columns=None, file_extensions=None, steps=None, parser=None, tag=None, verbose=None, threads=None, sep='\t', log=None, _log_to_file=True, update_db=False, upload_to_provstore=False, write_provn=False, write_pdf=False, **kwargs)
Parameters
  • name – Name of the workflow, with no spaces.

  • description – A brief (one sentence) description of the workflows.

  • input – Input of workflow. May be a directory or a tab-delimited file.

  • input_type – Input type of the workflow. Choose from (‘directory’, ‘dataframe’, ‘both’)

  • index_col – Name of index column which will define sample names if input_type is ‘dataframe’.

  • file_columns – Name of columns containing files if input_type is ‘dataframe’. Name of file tag if input_type is ‘directory’.

  • file_extensions – Extension of files if input_type is ‘directory’.

  • steps – Dictionary of steps. May also receive a list, tuple or None.

  • parser – argparse.ArgumentParser object used to construct the workflow’s command-line application.

  • tag – Tag of the Project being run.

  • verbose – Verbose output of workflow.

  • threads – Number of threads in workflow. Defaults to bioprov.config.threads

  • sep – Separator if input_type is ‘dataframe’.

  • _log_to_file – Whether to write log to file.

  • log – Path of the file to write the log to. Default is f’{workflow.tag}.log’.

  • update_db – Whether to automatically update the BioProv DB when running the workflow.

  • write_provn – Write PROVN output at the end of the workflow.

  • write_pdf – Write graphical output at the end of the workflow.

  • upload_to_provstore – Upload BioProvDocument to ProvStore at the end of the workflow.

  • kwargs – Other keyword arguments to be passed to workflow.

add_step(step)

Updates self.parser and self.steps with an instance of Step. :param step: An instance of Step containing a PresetProgram. :return:

property bioprovdocument
create_provenance()
generate_parser()
generate_project()

Generate Project instance from input. :return: Project instance.

run_steps(steps_to_run)

Runs steps for each sample. :param steps_to_run: Comma-delimited string of steps to run. :return:

start_logging()

Module contents

Init file for the src/ package.