bioprov.src package¶
Submodules¶
bioprov.src.config module¶
Contains the Config class and other package-level settings.
Define your configurations in the ‘config’ variable at the end of the module.
- class bioprov.src.config.BioProvDB(path)¶
Bases:
tinydb.database.TinyDB
Inherits from tinydb.TinyDB
Class to hold database configuration and methods.
- __init__(path)¶
Create a new instance of TinyDB.
- clear_db(confirm=False)¶
Deletes the local BioProv database. :param confirm: :return:
- class bioprov.src.config.Config(db_path=None, threads=0)¶
Bases:
object
Class to define package level variables and settings.
- __init__(db_path=None, threads=0)¶
- Parameters
db_path – Path to database file. Default is bioprov_directory/db.json
threads – Number of threads. Default is half of processors.
- clear_db(confirm=False)¶
Deletes the local BioProv database. :param confirm: :return:
- create_provstore_file(user=None, token=None)¶
- property db¶
- db_all()¶
- Returns
List all items in BioProv database.
- property db_path¶
- property logger¶
- property provstore_api¶
- property provstore_file¶
- property provstore_token¶
- property provstore_user¶
- read_provstore_file()¶
Attempts to read self.provstore_file. Will prompt to create one if unable to retrieve credentials.
- Returns
Updates self.provstore_user and self.provstore_token.
- serializer()¶
- class bioprov.src.config.Environment¶
Bases:
object
Class containing provenance information about the current environment.
- __init__()¶
Class constructor. All attributes are empty and are initialized with self.update()
- property actedOnBehalfOf¶
- serializer()¶
- update()¶
Checks current environment and updates attributes using the os.environ module. :return: Sets attributes to self.
bioprov.src.files module¶
Contains the File and SeqFile classes and related functions.
- class bioprov.src.files.Directory(path, tag=None)¶
Bases:
object
Class for holding information about directories.
- __init__(path, tag=None)¶
- add_files_to_object(object_, kind='files')¶
Add files or subdirs in self to object_, can be either a Sample or Project. :param object_: bioprov.Project or bioprov.Sample :param kind: Whether to add files or subdirectories. :return: Updates object_.files
- property exists¶
- get_files()¶
- get_subdirs()¶
- replace_path(old_terms, new, warnings=False)¶
Replace the current File path.
Usually used for switching between users.
- Parameters
old_terms – Terms to be replaced in the path.
new – New term.
warnings – Whether to warn if sha256 checksum differs or file does not exist.
- Returns
Updates self.
- serializer()¶
- class bioprov.src.files.File(path, tag=None, attributes=None, _get_hash=True)¶
Bases:
object
Class for holding files and file information.
- __init__(path, tag=None, attributes=None, _get_hash=True)¶
- Parameters
path – A UNIX-like file _path.
tag – optional tag describing the file.
attributes – Miscellaneous attributes.
- property entity¶
- property exists¶
- property raw_size¶
- replace_path(old_terms, new, warnings=False)¶
Replace the current File path.
Usually used for switching between users.
- Parameters
old_terms – Terms to be replaced in the path.
new – New term.
warnings – Whether to warn if sha256 checksum differs or file does not exist.
- Returns
Updates self.
- serializer()¶
- property sha256¶
- property size¶
- class bioprov.src.files.SeqFile(path, tag=None, format='fasta', parser='seq', document=None, import_records=False, calculate_seqstats=False)¶
Bases:
bioprov.src.files.File
Class for holding sequence file and sequence information. Inherits from File.
This class support records parsed with the BioPython.SeqIO module.
- __init__(path, tag=None, format='fasta', parser='seq', document=None, import_records=False, calculate_seqstats=False)¶
- Parameters
path – A UNIX-like file _path.
tag – optional tag describing the file.
format – Format to be parsed by SeqIO.parse()
parser – Bio parser to be used. Can be ‘seq’ (default) to be parsed by SeqIO or ‘align’ to be parsed with AlignIO.
document – prov.model.ProvDocument.
import_records – Whether to import sequence data as Bio objects
calculate_seqstats – Whether to calculate SeqStats
- property generator¶
- import_records(**kwargs)¶
- Parameters
kwargs – Parameters to pass to the SeqFile._calculate_seqstats() function.
- Returns
Import records into self.
- property max_seq¶
- property min_seq¶
- seqfile_formats = ('fasta', 'clustal', 'fastq', 'fastq-sanger', 'fastq-solexa', 'fastq-illumina', 'genbank', 'gb', 'nexus', 'stockholm', 'swiss', 'tab', 'qual')¶
- property seqstats¶
- serializer()¶
- class bioprov.src.files.SeqStats(number_seqs: int, total_bps: int, mean_bp: float, min_bp: int, max_bp: int, N50: int, GC: float)¶
Bases:
object
Dataclass to describe sequence statistics.
- GC: float¶
- N50: int¶
- __init__(number_seqs: int, total_bps: int, mean_bp: float, min_bp: int, max_bp: int, N50: int, GC: float) None ¶
- max_bp: int¶
- mean_bp: float¶
- min_bp: int¶
- number_seqs: int¶
- total_bps: int¶
- bioprov.src.files.calculate_N50(array)¶
Calculate N50 from an array of contig lengths. https://github.com/vikas0633/python/blob/master/N50.py
Based on the Broad Institute definition: https://www.broad.harvard.edu/crd/wiki/index.php/N50 :param array: list of contig lengths :return: N50 value
- bioprov.src.files.deserialize_files_dict(files_dict)¶
Deserialize a dictionary of files in JSON format. :param files_dict: dict of dicts. :return: dict of File instances.
- bioprov.src.files.seqrecordgenerator(path, format, parser='seq', warnings=False)¶
- Parameters
path – Path to file.
format – format to pass to SeqIO.parse().
parser – Whether to import records with SeqIO (default) or AlignIO
warnings – Whether to warn if sha256 checksum differs or file does not exist.
- Returns
A generator of SeqRecords.
bioprov.src.main module¶
Main source module. Contains the main BioProv classes.
- Activity classes:
Program
Parameter
Run
- Entity classes:
Project
Sample
This class also contains functions to read and write objects in JSON and tab-delimited formats.
- class bioprov.src.main.Parameter(key=None, value='', tag=None, cmd_string=None, description=None, kind=None, keyword_argument=True, position=- 1)¶
Bases:
object
Class holding information for parameters.
- __init__(key=None, value='', tag=None, cmd_string=None, description=None, kind=None, keyword_argument=True, position=- 1)¶
- Parameters
key – Key of the parameter, e.g. ‘-h’ for help command.
value – Value of the parameter.
tag – A tag of the parameter.
cmd_string – String representation of the parameter in a command.
description – description of the parameter.
kind – Kind of parameter. May be ‘input’, ‘output’, ‘misc’, or None.
keyword_argument – Whether the parameter is a keyword argument. Keyword arguments have a key, which is used to build the program’s command. If this is false, it is assumed that the parameter is a positional argument, and ‘position’ will indicate it’s index if the command line was split as a list.
position – Index of insertion of parameter in command-line if it is a positional argument.
- serializer()¶
- class bioprov.src.main.PresetProgram(name=None, params=None, sample=None, input_files=None, output_files=None, preffix_tag=None, extra_flags=None)¶
Bases:
bioprov.src.main.Program
Class for holding a preset program and related functions.
A WorkflowStep instance inherits from Program and consists of an instance of Program with an associated instance of Sample or Project.
- __init__(name=None, params=None, sample=None, input_files=None, output_files=None, preffix_tag=None, extra_flags=None)¶
- Parameters
name – Instance of bioprov.Program
params – Dictionary of parameters.
sample – An instance of Sample or Project.
input_files – A dictionary consisting of Parameter keys as keys and a File.tag as value, where File.tag is a string that must be a key in self.sample.files with a corresponding existing file.
output_files – A dictionary consisting of Parameter keys as keys and a tuple consisting of (File.tag, suffix) as value. File.tag will become a key in self.sample.files and the its value will be the sample_name + suffix.
preffix_tag – A value in the input_files argument, which corresponds to a key in self.sample.files. All file names of output files will be stemmed from this file, hence ‘preffix’.
extra_flags – A list of command line parameters (strings), known as flags or switches, to add to the program’s command.
- create_func(sample, preffix_tag=None)¶
- Parameters
sample – Instance of Sample to create the function for.
preffix_tag – Argument to be passed to self._parse_output_files()
- Returns
Creates Program function for Sample.
- generate_cmd()¶
TODO: improve this function
Generates a wildcard command string, independent of samples. :return: Updates self.cmd.
- run(sample=None, preffix_tag=None, **kwargs)¶
Runs PresetProgram for sample. :param sample: Instance of bioprov.Sample. :param preffix_tag: Preffix tag to self.create_func() :param kwargs: See help of Program.run() :return:
- validate_program()¶
Checks type of self :return:
- validate_sample()¶
Checks type of self.sample. :return:
- class bioprov.src.main.Program(name=None, params=None, tag=None, path_to_bin=None, version=None, cmd=None, sample=None)¶
Bases:
object
Class for holding information about programs.
- __init__(name=None, params=None, tag=None, path_to_bin=None, version=None, cmd=None, sample=None)¶
- Parameters
name – Name of the program being called.
params – Dictionary of parameters.
tag – Tag to call the program if different from name. Default: self.name
path_to_bin – A full _path to the program’s binary. Default: get from self.name.
cmd – A command string to call the program. Default: build from self._path and self.params.
version – Version of the program.
sample – Instance of bioprov.Sample
- add_parameter(parameter, _generate_cmd=True)¶
Adds a parameter to the current instance and updates the command.
- Parameters
parameter – an instance of the Parameter class.
_generate_cmd – Refreshes self.cmd when a Parameter is added.
- Returns
Updates self.params and self.cmd if _generate_cmd is True.
- add_runs(runs)¶
Sample method to add runs. :param runs: See input to add_runs function. :return: Adds runs to Sample
- property duration¶
- property end_time¶
- property finished¶
- generate_cmd()¶
Generates command string to execute.
- Returns
command string
- run(sample=None, suppress_stdout=True, suppress_stderr=True, force_print=False)¶
Runs the process. :param sample: An instance of bioprov.Sample. :param suppress_stdout: Whether to print stdout of the program. :param suppress_stderr: Whether to print stderr of the program. :param force_print: Whether to force printing the output of the program. :return: An instance of the Run class.
- property runs¶
- serializer()¶
- property start_time¶
- property status¶
- property stderr¶
- property stdin¶
- property stdout¶
- class bioprov.src.main.Project(tag=None, samples=None, db=None, auto_update=False, log_to_file=False)¶
Bases:
object
Class which holds a dictionary of Sample instances, where each key is the sample name.
- __init__(tag=None, samples=None, db=None, auto_update=False, log_to_file=False)¶
Initiates the object by creating a sample dictionary. :param samples: An iterator of Sample objects. :param tag: A tag to describe the Project. :param db: path to TinyDB to store project in JSON format. :param auto_update: Whether to auto_update the BioProvDB record. Disabled by default. :param log_to_file: Whether to log the Project to a File. You can define this later with the self.start_logging() method.
- add_files(files)¶
Adds Files to self.files. See documentation to bioprov.src.main.add_files().
- Parameters
files – Dict, list, or File instance.
- Returns
Updates self.files
- add_programs(programs)¶
Add programs to self. See documentation to bioprov.src.main.Progarm :param programs: Dict, list, or Program instance. :return: Updates self.programs
- auto_update_db()¶
Updates the database if auto_update is True.
- build_sample_dict(constructor)¶
Build sample dictionary from passed constructor. :param constructor: Iterable or NoneType :return: dictionary of sample instances.
- property bundle¶
- property entity¶
- static is_iterator(constructor)¶
Checks if the constructor passed is a valid iterable, or None. :param constructor: constructor object used to build a Project instance. :return:.
- static is_sample_and_name(sample)¶
Checks if an object is of the Sample class. Name the sample if it isn’t named. :param sample: an object of the Sample class. :return:
- items()¶
- keys()¶
- property name¶
- property namespace_preffix¶
- query_db(db=None)¶
- replace_paths(old_terms, new, warnings=False)¶
Runs File.replace_path(old_terms, new) on all Files in the project and each Sample.
For more information see File.replace_path() documentation.
- Parameters
old_terms – old terms to be replaced.
new – new term.
warnings – whether to activate warnings.
- Returns
Updates all Files associated with self.
- run_programs()¶
Runs all programs in self.programs in order.
- Returns
- property samples¶
- serializer()¶
- property sha256¶
- start_logging(log_file=None, level=20, _custom_start_message=None)¶
Starts logging Project to File.
- Parameters
log_file – Path to log file. If None will be defined automatically.
level – Logging level.
_custom_start_message – Custom starting message to start the log.
- Returns
Creates logger attributes and refreshes bp.config.logger
- to_csv(path_=None, sep=',', **kwargs)¶
Writes a tab-delimited file of sample files and attributes using the to_df method. :return:
- to_df()¶
Creates a Pandas DataFrame from Sample files and attributes. :return: pd.DataFrame
- to_json(_path=None)¶
Exports the Project as JSON. Similar to Sample.to_json() :param _path: JSON output file _path. :return:
- update_db(db=None)¶
- values()¶
- class bioprov.src.main.Run(program, sample=None)¶
Bases:
object
Class for holding Run information about a selected Program.
- __init__(program, sample=None)¶
- Parameters
program – An instance of bioprov.Program.
sample – An instance of bioprov.Sample
- run()¶
Runs process for the Run instance. Will update attributes accordingly. :return: self
- serializer()¶
- property status¶
- class bioprov.src.main.Sample(name=None, tag=None, files=None, directory=None, attributes=None)¶
Bases:
object
Class for holding sample information and related files and programs.
- __init__(name=None, tag=None, files=None, directory=None, attributes=None)¶
- Parameters
name – Sample name or ID.
tag – optional tag describing the sample.
files – Dictionary of files associated with the sample.
directory – A bioprov.Directory associated with the sample
attributes – Dictionary of any other attributes associated with the sample.
- add_files(files)¶
Sample method to add files. :param files: See input to add_files function. :return: Adds files to Sample
- add_programs(programs)¶
Adds program(s) to self. Must be an instance or iterable of bioprov.Program.
- Parameters
programs – bioprov.Program iterator or instance, value where key is the program name and value is a bp.Program instance.
- Returns
Updates self by adding the programs to object.
- auto_update_db()¶
- property directory¶
- run_programs()¶
Runs self._programs in order. :return:
- serializer()¶
Custom serializer for Sample class. Serializes runs, programs, and files attributes. :return:
- to_json(_path=None)¶
Exports the Sample as JSON. Similar to Project.to_json() :param _path: JSON output file path. :return:
- to_series()¶
Creates a pd.Series object from the sample files and attributes.
- Returns
pd.Series
- bioprov.src.main.deserialize_programs_dict(programs_dict, sample)¶
Deserialize programs from JSON format
- Parameters
programs_dict – dictionary of serialized Programs in JSON format
sample – instance of bioprov.Sample
- Returns
dictionary of Program instances
- bioprov.src.main.deserialize_runs_dict(runs_dict, programs_dict, tag, sample)¶
Deserialize runs in JSON format.
- Parameters
runs_dict – dictionary of Runs in JSON format.
programs_dict – dictionary of Program instances to be updated.
tag – Tag of each program.
sample – Sample to be updated.
- Returns
- bioprov.src.main.dict_to_sample(json_dict)¶
Converts a JSON dictionary to a sample instance. :param json_dict: output of sample_from_json. :return: a Sample instance.
- bioprov.src.main.from_df(df, index_col=0, file_cols=None, sequencefile_cols=None, tag=None, source_file=None, import_data=False)¶
Pandas-like function to build a Project object.
By default, assumes the sample names or ids are in the first column, else they should be specified by ‘index_col’ arg.
‘’’ samples = from_df(df_path, sep=” “)
type(samples) # bioprov.Sample.Project.
You can select columns to be added as Files or SequenceFile instances.
‘’’
- Parameters
df – A pandas DataFrame
index_col – A column to be used as index. Must be in df_path.columns. If int is passed, will get it from columns.
file_cols – Columns containing Files.
sequencefile_cols – Columns containing SequenceFiles.
tag – A tag to describe the Project.
source_file – The source file used to read the dataframe.
import_data – Whether to import data when importing SequenceFiles
- Returns
a Project instance
- bioprov.src.main.from_json(json_file, kind='Project', replace_path=None, replace_home=False)¶
Imports Sample or Project from JSON file.
- Parameters
json_file – A JSON file created by Sample.to_json()
kind – Whether to create a Sample or Project instance.
replace_path – A tuple or list with two strings. The first will be the old path to be replaced, and the second will be the new.
replace_home – If True, will run replace_path automatically for previous HOME paths.
- Returns
a Sample or Project instance.
- bioprov.src.main.generate_param_str(params)¶
TODO: improve this docstring Generates a string from a dictionary of parameters :param params: Dictionary of parameters. :return:
- bioprov.src.main.json_to_dict(json_file)¶
Reads dict from a JSON file. :param json_file: A JSON file created by Sample.to_json() :return: a dictionary (input to dict_to_sample())
- bioprov.src.main.load_project(tag, db=None, import_records=False)¶
Loads Project from the BioProvDatabase set in the config.
- Parameters
tag – Tag of the Project to be loaded.
db – Path to BioProvDB file. Default is set in the config module. (use the bioprov –show_db command).
import_records – Whether to import the sequence records. Unnecessary if this data is already recorded in the Project.
- Returns
Instance of Project.
- bioprov.src.main.parse_params(params)¶
Function used to parse parameter input. :param params: An instance or iterator of Parameter instances or a dictionary. :return: Parsed parameters to serve as attribute to a Program or Run instance.
- bioprov.src.main.read_csv(df_path, sep=',', **kwargs)¶
- Parameters
df_path – Path of dataframe.
sep – Separator of dataframe.
kwargs – Any kwargs to be passed to from_df()
- Returns
A Project instance.
- bioprov.src.main.to_json(object_, dictionary, _path=None)¶
Exports the Sample or Project as JSON. :return: Writes JSON output
bioprov.src.prov module¶
Module containing base provenance attributes.
This module extracts system-level information, such as user and environment settings, and stores them. It is invoked to export provenance objects.
- class bioprov.src.prov.BioProvDocument(project, add_attributes=False, add_users=True, _add_project_namespaces=True, _iter_samples=True, _iter_project=True)¶
Bases:
object
Class containing base provenance information for a Project.
- __init__(project, add_attributes=False, add_users=True, _add_project_namespaces=True, _iter_samples=True, _iter_project=True)¶
Constructs the W3C-PROV document for a project.
- Parameters
project (Project) – instance of bioprov.src.Project.
add_attributes (bool) – whether to add object attributes.
add_users (bool) – whether to add users and environments.
_add_project_namespaces (bool) –
_iter_samples (bool) –
_iter_project (bool) –
- property dot¶
- property provn¶
- property provstore_document¶
- upload_to_provstore(api=None)¶
Uploads self.ProvDocument. to ProvStore (https://openprovenance.org/store/)
- Parameters
api – provstore.api.Api
- Returns
Sends POST request to ProvStore API and updates self.ProvDocument if successful.
- write_provn(path=None)¶
Writes PROVN output of document. :param path: Path to write file. :return: Writes file.
bioprov.src.workflow module¶
Contains the Workflow class and related functions.
- class bioprov.src.workflow.Step(preset_program, default=False, description='', kind='Sample')¶
Bases:
bioprov.src.main.PresetProgram
Class for holding workflow steps.
Steps are basically PresetProgram instances but they do not have any Sample associated with them, and always generate command strings.
- __init__(preset_program, default=False, description='', kind='Sample')¶
- Parameters
preset_program – Instance of bioprov.PresetProgram.
default – Whether the Step runs by default.
description – Description of the step program.
kind – Whether the Step is associated with a s Sample or Project.
- class bioprov.src.workflow.Workflow(name=None, description=None, input=None, input_type='dataframe', index_col='sample-id', file_columns=None, file_extensions=None, steps=None, parser=None, tag=None, verbose=None, threads=None, sep='\t', log=None, _log_to_file=True, update_db=False, upload_to_provstore=False, write_provn=False, write_pdf=False, **kwargs)¶
Bases:
object
Workflow class. Used to build workflows for BioProv command line.
A workflow runs a series of steps (bioprov.Program) on a set of samples (bioprov.Project).
- __init__(name=None, description=None, input=None, input_type='dataframe', index_col='sample-id', file_columns=None, file_extensions=None, steps=None, parser=None, tag=None, verbose=None, threads=None, sep='\t', log=None, _log_to_file=True, update_db=False, upload_to_provstore=False, write_provn=False, write_pdf=False, **kwargs)¶
- Parameters
name – Name of the workflow, with no spaces.
description – A brief (one sentence) description of the workflows.
input – Input of workflow. May be a directory or a tab-delimited file.
input_type – Input type of the workflow. Choose from (‘directory’, ‘dataframe’, ‘both’)
index_col – Name of index column which will define sample names if input_type is ‘dataframe’.
file_columns – Name of columns containing files if input_type is ‘dataframe’. Name of file tag if input_type is ‘directory’.
file_extensions – Extension of files if input_type is ‘directory’.
steps – Dictionary of steps. May also receive a list, tuple or None.
parser – argparse.ArgumentParser object used to construct the workflow’s command-line application.
tag – Tag of the Project being run.
verbose – Verbose output of workflow.
threads – Number of threads in workflow. Defaults to bioprov.config.threads
sep – Separator if input_type is ‘dataframe’.
_log_to_file – Whether to write log to file.
log – Path of the file to write the log to. Default is f’{workflow.tag}.log’.
update_db – Whether to automatically update the BioProv DB when running the workflow.
write_provn – Write PROVN output at the end of the workflow.
write_pdf – Write graphical output at the end of the workflow.
upload_to_provstore – Upload BioProvDocument to ProvStore at the end of the workflow.
kwargs – Other keyword arguments to be passed to workflow.
- add_step(step)¶
Updates self.parser and self.steps with an instance of Step. :param step: An instance of Step containing a PresetProgram. :return:
- property bioprovdocument¶
- create_provenance()¶
- generate_parser()¶
- generate_project()¶
Generate Project instance from input. :return: Project instance.
- run_steps(steps_to_run)¶
Runs steps for each sample. :param steps_to_run: Comma-delimited string of steps to run. :return:
- start_logging()¶
Module contents¶
Init file for the src/ package.