The liquorice python package¶
liquorice.LIQUORICE module¶
- class liquorice.LIQUORICE.FullPaths(option_strings, dest, nargs=None, const=None, default=None, type=None, choices=None, required=False, help=None, metavar=None)[source]¶
Bases:
argparse.Action
Expand user- and relative-paths. From: https://gist.github.com/brantfaircloth/1252339. Does not convert the path to the blacklist if the value is ‘hg38’ or ‘10k_random, which is interpreted as an instruction to use the shipped list instead.
- liquorice.LIQUORICE.main()[source]¶
Main function for the LIQUORICE command line tool. Performs the following steps: Sanity checks for user input, setup of environment variables and general setup-steps, handles file paths, selects which steps need to be performed based on user input, logs the progress and performs the actual LIQUORICE analysis by calling
liquorice.utils.GlobalFragmentSize.get_list_of_fragment_lengths_and_avg_readlength()
,liquorice.utils.MeanSequencingDepth.sample_bam_to_get_sequencing_depth()
,liquorice.utils.Workflows.train_biasmodel_for_sample()
,liquorice.utils.Workflows.run_liquorice_on_regionset_with_pretrained_biasmodel()
, and/orliquorice.utils.Workflows.run_liquorice_train_biasmodel_on_same_regions()
after generating the corresponding objects.
liquorice.LIQUORICE_summary module¶
- liquorice.LIQUORICE_summary.boxplot_score_summary(summary_df, comparison_col, use_uncorrected_coverage)[source]¶
Summarize scores via boxplots, with one box per group (case/control) - region-set combination.
- Parameters
summary_df (
DataFrame
) – The output of the functioncreate_summary_table_LIQUORICE()
.comparison_col (
str
) – Use this column for the y axis of the boxplots.use_uncorrected_coverage (
bool
) – If True, indicate in the output filename that the uncorrected coverage scores are shown.
- liquorice.LIQUORICE_summary.check_difference_to_control(row, control_df, col, prediction_interval_control_col, negative_is_strong)[source]¶
For a row in a pandas.DataFrame, find control values for the same regionset and return if sample is significantly different compared to the controls, as measured by metric col. A sample is deemed significantly different if
its score in metric col lies outside the prediction interval of the control group.
- Parameters
row (
Series
) – A pandas.Series with the columns ‘region-set’, ‘is control’, and col, corresponding to a single sample.control_df (
DataFrame
) – A pandas.DataFrame with at least the columns ‘region-set’, and col. Should only contain data of control samples.col (
str
) – Use this column as a metric to determine differences.prediction_interval_control_col (
str
) – Name of the column that contains the prediction interval of the control group.negative_is_strong (
bool
) – Set to True if a very negative value of the metric is associated with a strong dip signal, such as for the dip area.
- Return type
str
- Returns
A string indicating the result of the comparison, or np.nan if row[col] is NaN.
- liquorice.LIQUORICE_summary.create_summary_table_LIQUORICE(dirname, control_name_list, these_regionsets_only, use_uncorrected_coverage=False, prediction_interval_alpha=0.05)[source]¶
For a LIQUORICE result directory, creates and writes to csv a pd.DataFrame summarizing the coverage drop metrics for all samples and region-sets. If control_name_list is given, compares the case samples to control samples and assesses significant differences, separately for each region-set.
- Parameters
use_uncorrected_coverage (
bool
) – If True, summarize the coverage profile that is not corrected for biases by LIQUORICE instead of the corrected coverage.these_regionsets_only (
List
[str
]) – Summarize only data for these regionsets.dirname (
str
) – Output directory of LIQUORICE in which to search for fitted_gaussians_parameter_summary.csv (or uncorrected_fitted_gaussians_parameter_summary.csv) filescontrol_name_list (
List
[str
]) – Sample names of the controls, which will be used as reference for generating z-scores.prediction_interval_alpha (
int
) – Alpha level for the prediction interval. Default 0.05: 95% prediction interval
- Return type
DataFrame
- Returns
A pandas.DataFrame in which all parameters saved in the ‘fitted_gaussians_parameter_summary.csv’ (or uncorrected_fitted_gaussians_parameter_summary.csv) files are summarized over all samples and region-sets. If control_name_list is not empty, additional columns of the DataFame contain metric comparisons to the the control samples in the form of z-scores of dip area and depth.
- liquorice.LIQUORICE_summary.get_binsize_and_extendto_from_saved_settings(dirname, regionset)[source]¶
Calls
verify_consistant_binning_settings()
and returns the binsize and extend_to settings used by LIQUORICE for a given regionset, if no error is raised.- Parameters
dirname (
str
) – Path to LIQUORICE’s working directory.regionset (
str
) – Name of the regionset of interest.
- Return type
Tuple
[int
,int
]- Returns
A tuple with two integers: binsize and extend_to.
- liquorice.LIQUORICE_summary.get_list_of_coveragefiles(dirname, regionset, use_uncorrected_coverage=False)[source]¶
Return a list of ‘corrected_coverage_mean_per_bin.csv’ or ‘uncorrected_coverage_mean_per_bin.csv’ files, or return an error if no such files could be found.
- Parameters
dirname (
str
) – path to LIQUORICE output directoryregionset (
str
) – Name of the regionset of interestuse_uncorrected_coverage (
bool
) – If True, load ‘uncorrected_coverage_mean_per_bin.csv’ files, else, load ‘corrected_coverage_mean_per_bin.csv files
- Return type
List
[str
]- Returns
A list of ‘corrected_coverage_mean_per_bin.csv’ or ‘uncorrected_coverage_mean_per_bin.csv’ file paths
- liquorice.LIQUORICE_summary.get_prediction_interval(values, alpha=0.05)[source]¶
Returns the upper and lower prediction interval for a list of values.
- Parameters
values (
List
) – Scores for which the prediction interval shall be calculatedalpha (
int
) – Alpha level for the prediction interval. Default 0.05: 95% prediction interval
- Return type
Tuple
[float
,float
]- Returns
A tuple (lower_prediction_interval, upper_prediction_interval)
- liquorice.LIQUORICE_summary.get_prediction_interval_per_row(row, control_df, col, alpha=0.05)[source]¶
For a row in a pandas.DataFrame, find control values for the same regionset and return the prediction interval.
- Parameters
row (
Series
) – A pandas.Series with the columns ‘region-set’, ‘is control’, and col, corresponding to a single sample.control_df (
DataFrame
) – A pandas.DataFrame with at least the columns ‘region-set’, and col. Should only contain data of control samples.col (
str
) – Use this column as a metric to determine differences.alpha (
int
) – Alpha level for the prediction interval. Default 0.05: 95% prediction interval
- Return type
Tuple
[float
,float
]- Returns
A tuple (lower prediction interval,higher prediction interval), or (np.nan,np.nan) if calculation is not possible.
- liquorice.LIQUORICE_summary.get_ymin_ymax_over_all_samples(coverage_filelist, normalize_by_intercept, regionset, summary_df)[source]¶
Get the the upper and lower limits for the y-axis for a given regionset, such that the same scale is used for all samples.
- Parameters
coverage_filelist (
List
[str
]) – List of ‘corrected_coverage_mean_per_bin.csv’ (or ‘uncorrected_coverage_mean_per_bin.csv’) filesnormalize_by_intercept (
bool
) – f True, extract the intercept from the fitted model to normalize the dips between samples (intercept of each sample is positioned at y=0). Otherwise, the mean coverage of each sample is positioned at y=0.regionset (
str
) – Name of the regionset of intest as given in the summary_dfsummary_df (
DataFrame
) – pd.DataFrame with columns ‘sample’, ‘region-set’, and ‘Intercept’.
- Return type
Tuple
[float
,float
]- Returns
A tuple of floats: y_min,y_max
- liquorice.LIQUORICE_summary.main()[source]¶
Main function for the LIQUOIRCE_summary tool. Calls the argument parser, checks user input, and calls the functions
verify_consistant_binning_settings()
create_summary_table_LIQUORICE()
(saving output to .csv),plot_overlay()
,plot_as_subplots()
,boxplot_score_summary()
andplot_control_distributions_and_test_normality()
.
- liquorice.LIQUORICE_summary.parse_args()[source]¶
Parses the arguments from the command line. For a full list of arguments, see the documentation of the LIQUORICE_summary command line tool.
- Returns
An argparse.ArgumentParser object storing the arguments.
- liquorice.LIQUORICE_summary.plot_as_subplots(dirname, summary_df, control_name_list, extend_to=None, binsize=None, normalize_by_intercept=True, smooth_sigma=0, significance_col='Dip area: interpretation vs controls in same region set', use_uncorrected_coverage=False, y_min_fixed=None, y_max_fixed=None)[source]¶
Summarize plots, create one set of plots per regionset. Case samples with significant differences to controls (based on significance_col) are marked by color.
smooth_sigma can be used to make the plots smoother and easier to compare, this is a visual effect only.
- Parameters
dirname (
str
) – Path to LIQUORICE’s working directory.summary_df (
DataFrame
) – The output of the functioncreate_summary_table_LIQUORICE()
.control_name_list (
List
[str
]) – List of names of samples that should be plotted as controls. Can be empty.extend_to (
Optional
[int
]) – extend_to setting that was used when running LIQUORICE. If None, infer from the binning_settings.json files that are saved by LIQUORICE by default.binsize (
Optional
[int
]) – binsize setting that was used when running LIQUORICE. If None, infer from the binning_settings.json files that are saved by LIQUORICE by default.normalize_by_intercept (
bool
) – If True, extract the intercept from the fitted model to normalize the dips between samples (intercept of each sample is positioned at y=0). Otherwise, the mean coverage of each sample is positioned at y=0.smooth_sigma (
float
) – Visually smooth the coverage signals, using this strength of smoothing. Higher values indicate stronger smoothing. Set to 0 for no smoothing.significance_col (
str
) – Use this columns as an indicator for significant differences between case and control samples. Can contain the following values: “Significantly stronger coverage drop”, “Significantly weaker coverage drop”, “n.s.”, and “N/A”.use_uncorrected_coverage (
bool
) – If True, plot the coverage profile that is not corrected for biases by LIQUORICE.y_min_fixed (
Optional
[float
]) – If specified, use this value as the minimum value for the y axis.y_min_fixed – If specified, use this value as the maximum value for the y axis.
- liquorice.LIQUORICE_summary.plot_control_distributions_and_test_normality(summary_df, col, alpha=0.05, use_uncorrected_coverage=False)[source]¶
For each region-set in summary_df, plot the distribution of the control sample’s score measured by metric col as histograms and probability plots (https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.probplot.html). Also run tests the Shapiro-Wilk test for normal distributions and prints a warning if this test detects significant deviations from the normal distribution. Plotting and testing is skipped for every region-set where n<4.
- Parameters
summary_df (
DataFrame
) – Input pandas.DataFrame with the columns “is_control” and colcol (
str
) – Name of the column that should be used for plottinguse_uncorrected_coverage (
bool
) – If True, indicate in the file name that the uncorrected coverage has been used to generate the underlying data.
- Returns
- liquorice.LIQUORICE_summary.plot_overlay(dirname, summary_df, control_name_list, extend_to=None, binsize=None, normalize_by_intercept=True, smooth_sigma=3, significance_col='Dip area: interpretation vs controls in same region set', use_uncorrected_coverage=False, alpha=0.5, linewidth=3)[source]¶
Summarize plots, create one plot per regionset. Case samples with significant differences to controls (based on significance_col) are marked by color. smooth_sigma can be used to make the plots smoother and easier to compare - this is a visual effect only.
- Parameters
dirname (
str
) – Path to LIQUORICE’s working directory.summary_df (
DataFrame
) – The output of the functioncreate_summary_table_LIQUORICE()
.control_name_list (
List
[str
]) – List of names of samples that should be plotted as controls. Can be empty.extend_to (
Optional
[int
]) – extend_to setting that was used when running LIQUORICE. If None, infer from the binning_settings.json files that are saved by LIQUORICE by default.binsize (
Optional
[int
]) – binsize setting that was used when running LIQUORICE. If None, infer from the binning_settings.json files that are saved by LIQUORICE by default.normalize_by_intercept (
bool
) – If True, extract the intercept from the fitted model to normalize the dips between samples (intercept of each sample is positioned at y=0). Otherwise, the mean coverage of each sample is positioned at y=0.smooth_sigma (
float
) – Visually smooth the coverage signals, using this strength of smoothing. Higher values indicate stronger smoothing. Set to 0 for no smoothing.significance_col (
str
) – Use this columns as an indicator for significant differences between case and control samples. Can contain the following values: “Significantly stronger coverage drop”, “Significantly weaker coverage drop”, “n.s.”, and “N/A”.use_uncorrected_coverage (
bool
) – If True, plot the coverage profile that is not corrected for biases by LIQUORICE.alpha (
float
) – Alpha (transparency) parameter for plotting.linewidth (
float
) – Linewidth parameter for plotting
- liquorice.LIQUORICE_summary.verify_consistant_binning_settings(dirname, regionset)[source]¶
Asserts that, for the given regionset, all samples have the same binning settings. Calls sys.exit() with an error message if settings are inconsistent or cannot be found.
- Parameters
regionset_list – A list of region-sets to be analyzed.
- liquorice.LIQUORICE_summary.zscore_to_control(row, control_df, col)[source]¶
For a row in a dataframe, find control values for the same regionset and return the zscore of the sample compared to these controls, as measured by metric “col”.
- Parameters
row (
Series
) – A pd.Series with the columns ‘region-set’, ‘is control’, and col, corresponding to a single sample.control_df (
DataFrame
) – A pandas.DataFrame with at least the columns ‘region-set’, and col. Should only contain data of control samples.col (
str
) – Calculate z-score by comparing this metric between the sample and controls.
- Return type
float
- Returns
z-score, rounded to 2 decimals
Subpackage: liquorice.utils¶
liquorice.utils.AggregateAcrossRegions module¶
- liquorice.utils.AggregateAcrossRegions.aggregate_across_regions(df, column_of_interest)[source]¶
- Parameters
df (
DataFrame
) – Apandas.DataFrame
, containing the columns “bin nr.” and column_of_interestcolumn_of_interest (
str
) – Column name for which the mean per bin nr. should be returned
- Return type
Series
- Returns
A
pandas.Series
with mean values of column_of_interest per bin, aggregated across all regions in thepandas.DataFrame
.
Example:
mean_corrected_coverage_leftmost_bin = aggregate_across_regions(df,"corrected coverage").loc[0]
liquorice.utils.BiasFactorWeights module¶
- liquorice.utils.BiasFactorWeights.get_GC_weights_binsize1(fragments)[source]¶
For a given fragment size distribution, returns a list corresponding to the GC weight vector for a bin of size 1
- Parameters
fragments (
List
[int
]) – A list of fragment lengths that is representative of the sample’s global fragment length distribution- Return type
List
[float
]- Returns
A list corresponding to the GC weight vector for a bin of size 1. Its length is max(fragments)*2, the weight corresponding to the bin itself is at index max(fragments). Values represent influence of a nucleotide at a given position on the bin’s coverage.
- liquorice.utils.BiasFactorWeights.get_GC_weights_binwide(binsize, fragments, dont_slide=False)[source]¶
For a given binsize, calculate the influence of a nucleotide at a given position on the bin’s coverage
- Parameters
binsize (
int
) – Size of the binfragments (
List
[int
]) – A list of fragment lengths that is representative of the samples global fragment length distributiondont_slide (
bool
) – Set to False to return weights of 1 in the bin area and 0 outside of it
- Return type
List
[float
]- Returns
A list of length (2*max(fragments)+binsize), the weight corresponding to the first position within the bin itself is at index max(fragments). Values represent influence of a nucleotide at a given position on the bin’s coverage.
- liquorice.utils.BiasFactorWeights.get_dinuc_weights_binwide(GC_weights)[source]¶
For a given binsize, calculate the influence of a given dinucleotide starting at a given position on the bin’s coverage. This simply averages the GC weight of the position of interest and the following position. For the last position, a weight of 0 is returned.
- Parameters
GC_weights (
List
[float
]) – result ofget_GC_weights_binwide()
for the appropriate binsize and fragments of interest.- Return type
List
[float
]- Returns
A list of length (2*max(fragments)+binsize), the weight corresponding to the first position within the bin itself is at index max(fragments). Values represent influence of a dinucleotide starting at a given position on the bin’s coverage.
- liquorice.utils.BiasFactorWeights.get_mapp_weights_binwide(binsize, fragments, dont_slide=False)[source]¶
For a given binsize, calculate the influence a fragment starting/ending at a given position on the bin’s coverage
- Parameters
binsize (
int
) – Size of the binfragments (
List
[int
]) – A list of fragment lengths that is representative of the samples global fragment length distributiondont_slide (
bool
) – Set to False to return weights of 1 in the bin area and 0 outside of it
- Return type
Tuple
[List
[float
]]- Returns
Two lists of length (2*max(fragments)+binsize). Values represent influence a fragment starting/ending at a given position on the bin’s coverage. First list is for forward mappability (fragments starting at the position), second list is for reverse mappability (fragments ending at the position)
- liquorice.utils.BiasFactorWeights.get_nucleotide_dicts()[source]¶
Prepare dictionaries with di-and trinucleotides, as well as a dictionary for forward/reverse complement translation.
- Return type
Tuple
[Dict
[str
,int
]]- Returns
Three dictionaries: All dinucleotides excluding reverse complements, all trinucleotides excluding reverse complements, and a dictonary that allows translation of reverse complements to their forward complement counterparts (in this order). Keys are all 0.
- liquorice.utils.BiasFactorWeights.get_trinuc_weights_binwide(GC_weights)[source]¶
For a given binsize, calculate the influence of a given trinucleotide starting at a given position on the bin’s coverage. This simply averages the GC weight of the position of interest and the following 2 positions. For the last two positions, a weight of 0 is returned.
- Parameters
GC_weights (
List
[float
]) – result of get_GC_weights_binwide for the binsize and fragments of interest.- Return type
List
[float
]- Returns
A list of length (2*max(fragments)+binsize), the weight corresponding to the first position within the bin itself is at index max(fragments). Values represent influence of a trinucleotide starting at a given position on the bin’s coverage.
liquorice.utils.BiasModel module¶
- class liquorice.utils.BiasModel.BiasModel(training_df, df_to_correct, biasmodel_path='trained_biasmodel.joblib', features='all', nr_of_bins_for_training_and_testing=10000, sklearn_model=None, n_jobs=1, filename_performance_metrics='biasmodel_performance_metrics.csv', filename_feature_importances=None, use_binsize_as_feature=False)[source]¶
Bases:
object
An object that can be used to train a machine learning model that predicts coverage based on bias factors. The performance of the trained model can be tested, and it can be used to correct coverage values by regressing out the influence of the bias factors.
- Parameters
training_df (
Optional
[DataFrame
]) – pandas.DataFrame used for training (and optionally for performance assessment) of the model. Must contain the column coverage and all columns specified under features. Can be None iftrain_biasmodel()
won’t be called. Ignored in :func`train_biasmodel_2fold_CV_and_predict`.df_to_correct (
Optional
[DataFrame
]) – pandas.DataFrame for which coverage should be corrected by the trained model. Must contain the column coverage and all columns specified under features. Can be None ifget_table_with_corrected_coverage_using_trained_biasmodel()
won’t be called.biasmodel_path (
str
) – Path to which the trained biasmodel should be saved to and/or loaded from. Must be a .joblib file. Can be None ifget_table_with_corrected_coverage_using_trained_biasmodel()
won’t be called, in that case, no biasmodel will be saved.features (
Union
[str
,List
[str
]]) – A list of bias factors that should be used as features for the machine learning model. Default “all” sets all bias-factors as features: forward,reverse, and max mappability, di/trinucleotide factors and GC-content. Bin size is not included by default, it can be added as a feature via use_binsize_as_feature.nr_of_bins_for_training_and_testing (
Optional
[int
]) – Subset the training_df to this many bins. Can speed up the computation time, but using too few bins will make the model less precise. Can be None, then all bins will be used.sklearn_model (
Optional
[SklearnStyleRegressor
]) – A regressor that implements to functions .fit() and .predict() (e.g. from sklearn). Default of None means using sklearn.ensemble.HistGradientBoostingRegressor with default settings.n_jobs (
int
) – How many jobs to run in parallel when training the modelfilename_performance_metrics (
Optional
[str
]) – If test_fraction or cross_validate_k is set, save a .csv containing the performance metrics (r2, MSE) to this path.filename_feature_importances (
Optional
[str
]) – If set, save a .csv file containing the feature importances inferred from the trained model to this path.use_binsize_as_feature (
bool
) – If True, include “bin size” as a feature for the model.
- get_table_with_corrected_coverage_using_trained_biasmodel()[source]¶
Predicts coverage based on
features
in the DataFramedf_to_correct
and the biasmodel underbiasmodel_path
. Subtracts this prediction from the observed coverage to regress out the effect of the bias-factors (i.e.features
) on the coverage.- Return type
DataFrame
- Returns
Returns
df_to_correct
with an additional column “corrected coverage”.
- train_biasmodel()[source]¶
Train a machine learning model that predicts the values of the ‘coverage’ column based on the given features. If
test_fraction
is set, this fraction of thetraining_df
is set aside to evaluate R^2 and MSE of the model. Prior to training and a potential train/test split, thetraining_df
is subsetted tonr_of_bins_for_training_and_testing
if it is not None. Writes writes a .joblib file containing the biasmodel tobiasmodel_path
(unless it is None)- Return type
None
- train_biasmodel_2fold_CV_and_predict(exclude_these_bin_nrs=[])[source]¶
Train a machine learning model that predicts the values of the ‘coverage’ column based on the given features. Will use each half of the
df_to_correct
to train the model for predictions of the other half. Ignoresnr_of_bins_for_training_and_testing
,cross_validate_k
and writes returns the performancemetrics and returns the dataframe with predictions. Important: does not use :attr`training_df`.
- Return type
None
liquorice.utils.BinTableBiasFactors module¶
- class liquorice.utils.BinTableBiasFactors.BiasFactorHandler(binsize, fragments, readlength, df, n_cores=1, skip_these_biasfactors=[])[source]¶
Bases:
object
Object used for calculation of per-bin bias factors. Typically, after creation of the object a user would call its method
get_table_with_bias_factors()
on it, and save the returned DataFrame for subsequent analysis and correction of associations of bias factors with coverage.- Parameters
binsize (
int
) – Size of the bins. Higher values to reduce noise, lower values increase spatial resolution.fragments (
List
[int
]) – A list containing fragment lengths that are representative of the sample’s global fragment size distribution. Typically a few hundred fragments will suffice here.readlength (
int
) – Average length of reads in the .bam file.df (
DataFrame
) – pandas.DataFrame with one row per bin, containing columns “chromosome”, “start”, “end”, “bin nr.”, “coverage”, “sequence”, and “mappability”. Suitable input is the output of theget_complete_table()
method of theliquorice.utils.CoverageAndSequenceTablePreparation
class object.n_cores (
int
) – Max number of cores to use for calculations.skip_these_biasfactors (
List
[str
]) – Do not calculate these bias factors. Only these entries are allowed: [“di and trinucleotides and GC content”,”mappability”, “di and trinucleotides”]
- get_GC_and_di_and_trinuc_weights(sequence)[source]¶
Calculate bias factors for GC-content as well as bias factors for all di- and trinucleotides (reverse and forward complements merged into single factors) for a given sequence. Factors are scaled between 0 and 1, 1 is highest (e.g. GC content: 0.. no G or C, 0.461… average GC content, 1… only G and C).
- Parameters
sequence (
str
) – Genomic sequence of the bin extended by max(fragments
) in both directions, such that its length matchesGC_weights
.- Return type
Dict
[str
,float
]- Returns
A dictionary with entries corresponding to the bias factors for GC-content as well as bias factors for all di- and trinucleotides (reverse and forward complements merged into single factors).
- get_GC_bias_factor(sequence)[source]¶
Returns a number in the range 0-1 that represents the bin’s overall GC bias. Factors are scaled between 0 and 1, 1 is highest (e.g. GC content: 0.. no G or C, 0.461… average GC content, 1… only G and C).
- Parameters
sequence (
str
) – Genomic sequence of the bin extended by max(fragments
) in both directions, such that its length matchesGC_weights
.- Return type
float
- Returns
A number in the range 0-1 that represents the bin’s overall GC bias.
- get_fwd_mappability_bias_factor(mappability)[source]¶
Returns a number representing the bin’s mappability for fragments on the forward strand.
- Parameters
mappability (
List
[float
]) – A vector of values between 0 and 1, representing the mappbility of positions in and around the bin of interest. Must correspond to the coordinates <Bin start coord> - max(fragments
) -readlength
to <Bin end coord> + max(fragments
) +readlength
- Return type
float
- Returns
A number in range 0-1 that represents the bin’s overall forward mappability. 1 … highest
- get_rev_mappability_bias_factor(mappability)[source]¶
Returns a number representing the bin’s mappability for fragments on the reverse strand.
- Parameters
mappability (
List
[float
]) – A vector of values between 0 and 1, representing the mappbility of positions in and around the bin of interest. Must correspond to the coordinates <Bin start coord> - max(fragments
) -readlength
to <Bin end coord> + max(fragments
) +readlength
- Return type
float
- Returns
A number in range 0-1 that represents the bin’s overall reverse mappability. 1 … highest
- get_table_with_bias_factors()[source]¶
Main method to retrieve bias factor information.
- Return type
DataFrame
- Returns
A pandas.DataFrame with one row per bin, containing columns of the input DataFrame
df
(“chromosome”, “start”, “end”, “bin nr.”, “coverage”, “sequence”) as well as newly added columns for bias factors for mappability, GC-content, and all di- and trinucleotides (with reverse and forward complements merged into single factors). Bias factors are scaled between 0 and 1. Higher values correspond to higher nucleotide content, mappability etc. Example: An average bin is expected to have a bias factor of 0.461 for humans (i.e. genomic GC content). Note that the input columns “mappability” and “sequence” will not be part of the returned dataframe in order to reduce the memory usage of this function.
liquorice.utils.CorrectForCNAs module¶
- liquorice.utils.CorrectForCNAs.correct_coverage_per_bin_for_cnas(df, cna_seg_filepath, n_cores=1)[source]¶
Extracts the log2(observed/expected) read depth ratio for the corresponding segment for each bin, and corrects each bin’s coverage accordingly.
- Parameters
df (
DataFrame
) – A pandas.DataFrame with at least the following columns: “chromosome”, “start”,”end”,”coverage”cna_seg_filepath (
str
) – A .seg file that should be used to correct the coverage by the values specified in this file. File must be tab-separated, with column names as first line. The second, third and fourth column must be chromosome, start, and end of the segment, and the last column must be the log2-ratio of observed/expected read depth.n_cores (
int
) – How many cores to use to parallelize the operation.
- Return type
DataFrame
- Returns
A pandas.DataFrame similar to the input df, with the values in the column “coverage” changed according to the CNA correction factor of each bin, and an additional column “CNA-uncorrected coverage” that contains the original coverage values.
- liquorice.utils.CorrectForCNAs.get_CNV_corrected_coverage_for_bin(bin_chrom, bin_start, bin_end, bin_coverage, seg_df)[source]¶
For a given bin, returns the coverage corrected for copy number aberrations.
- Parameters
bin_chrom (
str
) – Chromosome of the bin.bin_start (
int
) – Genomic start position of the bin.bin_end (
int
) – Genomic end position of the bin.bin_coverage (
float
) – Bin’s coverage before CNA correctionseg_df (
DataFrame
) – A pandas.DataFrame of a .seg file, with the following columns: “chromosome”, “start”,”end”, “1/log2_correction_factor”.
- Return type
float
- Returns
A float indicating the CNV-corrected coverage for the bin.
liquorice.utils.CoverageAndSequenceTablePreparation module¶
- class liquorice.utils.CoverageAndSequenceTablePreparation.CoverageAndSequenceTablePreparation(bam_filepath, bins_bed_filepath, refgenome_filepath, refgenome_chromsizes_filepath, refgenome_mappability_bigwig_path, readlength, longest_fraglen, mean_seq_depth, n_cores=1, min_mapq=20, skip_these_steps=[], **additional_crpb_kwargs)[source]¶
Bases:
object
Object used to set up a pandas.DataFrame containing basic information about each bin: Coverage (normalized for overall sequencing depth), genomic sequence & mappability (both incl. surrounding regions; used for bias correction). Usually, a user would want to call
get_complete_table()
on this object, and save the resulting pandas.DataFrame for subsequent bias-correction.- Parameters
bam_filepath (
str
) – Path to the .bam file containing the mapped reads for the sample. Does not need to be duplicate-filtered, this is done by the function that calculates the coverage.bins_bed_filepath (
str
) – Path to the .bed file containing chromosome, start, end, and bin nr. of every bin to be analyzed (in that order). Output ofliquorice.utils.RegionFilteringAndBinning.Binning.write_bin_bedfile()
is a suitable input here.refgenome_filepath (
str
) – Path to the reference genome .fa(.gz) file. Must have a .fai index in the same dirname.refgenome_chromsizes_filepath (
str
) – Path to a tab-delimited file containing the chromosome sizes for the reference genome. The first column must be the chromosome name, the second column its size. Can be the .fa.fai file associated to the reference genome.refgenome_mappability_bigwig_path (
str
) – Path to a .bigWig file containing (forward) mappability values scaled between 0 and 1.readlength (
int
) – (Average) length of the reads in the .bam file.longest_fraglen (
int
) – Length by which sequencing and mappability information is extended beyond a bin’s borders. Typically, this should be the longest sampled fragment length.min_mapq (
int
) – Minimum mapping quality for a fragment to still be counted when calculating the coverage.n_cores (
int
) – Number of cores to be used by deeptool’s CountReadsPerBin, which is used for coverage calculations. Set to higher values to speed up the process.mean_seq_depth (
float
) – A float that quantifies the global coverage/sequencing depth. Coverages per bin will be normalized (i.e. divided) by this value.skip_these_steps (
List
[str
]) – A list containing the steps which should be skipped. May contain: “coverage”,”sequence”, “mappability”. Default []: Don’t skip any steps.**additional_crpb_kwargs –
Additional keywords to be used for deeptool’s CountReadsPerBin. Can be used to override the default settings: “ignoreDuplicates”:True, “samFlag_exclude”:256, “extendReads”:True
- get_complete_table()[source]¶
Main method to retrieve a pandas.DateFrame that can be used to calculate bias factors on. Calls
read_bed', :func:()
.get_coverage_per_bin_mean_over_per_basepair_values_chunked`,get_sequence_per_bin()
, andget_mappability_per_bin', constructs :attr:()
.df`, a pandas.DataFrame, from the results, and returns it.- Return type
DataFrame
- Returns
A pandas.DataFrame where rows correspond to the genomic bins in
bins_bed_filepath
. This DataFrame has the following columns: “chromosome”,”start”,”end”,”bin nr.”,”coverage”,”sequence”, “mappability”. Entries in “sequence” and “mappability” columns are extended bylongest_fraglen
(and downstream additionally byreadlength
for “mappability”).
- get_coverage_per_bin()[source]¶
Calculates the normalized coverage per bin, based on
bins_bed_filepath
andbam_filepath
. Uses deeptool’s countReadsPerBin for calculating the coverage in every bin indf
. The obtained coverage value is then divided bymean_seq_depth
. This function is faster thanget_coverage_per_bin_mean_over_per_basepair_values_chunked()
, but provides less accurate results: Rather than the mean over the coverage of each base-pair in the bin, the total number of reads mapping to the bin (even partially mapping reads) are reported (after normalization for :attr:.mean_seq_depth).- Return type
Series
- Returns
A pandas.Series with the coverage per bin, normalized for total sequencing coverage.
- get_coverage_per_bin_mean_over_per_basepair_values_chunked()[source]¶
Calculates the normalized coverage per bin by averaging the per-base coverage over all positions in the bin and then deviding the result by
mean_seq_depth
. Requires the following columns indf
: “chromosome”,”start”,”end”,”bin nr.”,”region nr.”,”bin size”. Uses deeptool’s countReadsPerBin for calculating the coverage in every bin indf
. The obtained coverage value is then divided bymean_seq_depth
.- Return type
Series
- Returns
A pandas.Series with the coverage per bin, normalized for total sequencing coverage.
- get_mappability_for_bin(row, mapp_bw)[source]¶
For a single bin, get a list of (forward) mappability values in and around the bin based on its genomic coordinates and a pyBigWig.bigWig file that contains the genome-wide mappability.
- Parameters
row (
Series
) – pandas.Series containing columns “chromosome”,”start”, and “end”.mapp_bw (
bigWigFile
) – pyBigWig.pyBigWig object with mappability info (i.e. result of pyBigWig.open())
- Return type
List
[float
]- Returns
Mappability for the bin at base resolultion. Extended downstream by (
longest_fraglen
+readlength
) and upstream bylongest_fraglen
.
- get_mappability_per_bin()[source]¶
Opens a .bigWig file using pyBigWig and calls
get_mappability_for_bin()
for every bin.- Return type
Series
- Returns
A pandas.Series with mappability information per bin. Extended downstream by (
longest_fraglen
+readlength
) and upstream bylongest_fraglen
.
- get_sequence_per_bin()[source]¶
Get a list of genomic sequences, one per entry in the .bed file attr:bins_bed_filepath. Sequences are extended by
longest_fraglen
- Return type
List
[str
]- Returns
A list of the genomic sequences of every bin in the attr:bins_bed_filepath.
- read_bed()[source]¶
Reads in a .bed file and return corresponding pandas.DataFrame. 4th column in the .bed file is interpreted as bin nr. Layout of the .bed file MUST be [“chromosome”,”start”,”end”,”bin nr.”].
- Return type
DataFrame
- Returns
A pandas.DataFrame with the columns [“chromosome”,”start”,”end”,”bin nr.”].
- liquorice.utils.CoverageAndSequenceTablePreparation.parallelize_get_coverage_of_region(df, n_cores, bam_filepath, min_mapq, additional_crpb_kwargs)[source]¶
Splits the provided pandas.DataFrame into chunks of the size :param`n_cores`, and runs :func`run_get_coverage_of_region_per_chunk` in parallel on every chunk. From: https://towardsdatascience.com/make-your-own-super-pandas-using-multiproc-1c04f41944a1
- Parameters
df – Full pandas.DataFrame with the columns “chromosome”,”region start”,”region end”.
n_cores – Run this many processes in parallel.
bam_filepath – Path to the .bam file containing the mapped reads for the sample. Does not need to be duplicate-filtered, this is done by the function that calculates the coverage.
min_mapq – Minimum mapping quality for a fragment to still be counted when calculating the coverage.
additional_crpb_kwargs – Additional keywords to be used for deeptool’s CountReadsPerBin. Can be used to override the default settings: “ignoreDuplicates”:True, “samFlag_exclude”:256, “extendReads”:True
- Returns
Full pandas.DataFrame with the columns “chromosome”,”region start”,”region end” and “region coverage array”.
- liquorice.utils.CoverageAndSequenceTablePreparation.run_get_coverage_of_region_per_chunk(dataframe_chunk, bam_filepath, min_mapq, n_cores, additional_crpb_kwargs)[source]¶
Uses :func`deeptools.countReadsPerBin.get_coverage_of_region()` to calculate the coverage in every bin in :param:`dataframe_chunk`. Adds the result as a column “region coverage array” and returns param:dataframe_chunk. This function can receive only picklable arguments, and can therefore be called multiple times in parallel.
- Parameters
dataframe_chunk – A pandas.DataFrame with the columns “chromosome”,”region start”,”region end”. Usually, this would be only a chunk of the full DataFrame.
bam_filepath – Path to the .bam file containing the mapped reads for the sample. Does not need to be duplicate-filtered, this is done by the function that calculates the coverage.
min_mapq – Minimum mapping quality for a fragment to still be counted when calculating the coverage.
n_cores – Number of cores used by deeptools.countReadsPerBin. Should probably be 1 if the function is called in parallel
additional_crpb_kwargs – Additional keywords to be used for deeptool’s CountReadsPerBin. Can be used to override the default settings: “ignoreDuplicates”:True, “samFlag_exclude”:256, “extendReads”:True
- Returns
A pandas.DataFrame with the columns “chromosome”,”region start”,”region end” and “region coverage array”.
liquorice.utils.FitGaussians module¶
- class liquorice.utils.FitGaussians.FitGaussians(unfitted_y_values, extend_to, binsize, avg_centersize, samplename=None, regionset_name=None)[source]¶
Bases:
object
An object that can be used for quantification of coverage drops by fitting gaussian functions and an intercept to the coverage values.
- Parameters
unfitted_y_values (
List
[float
]) – A pandas Series of coverage values, one value per bin (and sorted by increasing bin nr.). A suitable input is the result ofliquorice.utils.AggregateAcrossrRegions.aggregate_across_regions()
.extend_to (
int
) – extend_to setting that was used when calculating the y values. Needed to calculate relative bin positions.binsize (
int
) – binsize setting that was used when calculating the y values. Needed to calculate relative bin positions.samplename (
Optional
[str
]) – If not None, add this in a column “sample” to the result DataFrame when callingfit_gaussian_models()
regionset_name (
Optional
[str
]) – If not None, add this in a column “region-set” to the result DataFrame when callingfit_gaussian_models()
avg_centersize (
float
) – Average size (i.e. width in bp) of the regions of interest (before extension by extend_to).
- fit_gaussian_models(g1_min_sigma=20, g1_max_sigma=200, g2_min_sigma=200, g2_max_sigma=3000, g3_min_sigma=3000, g3_max_sigma=40000, method='leastsq')[source]¶
Fits three gaussian functions to
unfitted_y_values
(usually aggregated coverage values), based on given limits. Setsx_values
,g1_y_values
,g2_y_values
,g3_y_values
,intercept_y_values
andcombined_model_y_values
for the object.- Parameters
g1_min_sigma (
Union
[int
,float
]) – Min σ value for the smallest gaussiang1_max_sigma (
Union
[int
,float
]) – Max σ value for the smallest gaussiang2_min_sigma (
Union
[int
,float
]) – Min σ value for the middle gaussiang2_max_sigma (
Union
[int
,float
]) – Max σ value for the middle gaussiang3_min_sigma (
Union
[int
,float
]) – Min σ value for the widest gaussiang3_max_sigma (
Union
[int
,float
]) – Max σ value for the widest gaussianmethod (
str
) – A method for fitting the combined model, used as parameter in lmfit’s .fit() function.
- Return type
DataFrame
- Returns
A pandas.DataFrame containing several summary metrics based on the fitted model: The amplitude and σ values for each of the three gaussians, the total depth of the fitted model at the central bin, the Bayesian Information Criterion (measures model fit), and the area under the curve of the fitted model, calculated as the area between the fitted intercept and the fitted model. If
samplename
orregionset_name
are not None, columns “sample” or “region-set” are added, respectively.
liquorice.utils.GlobalFragmentSize module¶
- liquorice.utils.GlobalFragmentSize.getFragmentLength_worker(chrom, start, end, bamFile, distanceBetweenBins)[source]¶
This is a function from deeptools, modified from https://github.com/deeptools/deepTools/blob/ac42d29c298c026aa0c53c9db2553087ebc86b97/deeptools/getFragmentAndReadSize.py#L14 Queries the reads at the given region for the distance between reads and the read length. As opposed to the original version, does not disregard pairs that are flagged as “not properly paired”, as this behaviour excludes fragments from the two-nucleosome peak of the cfDNA-distribution.
- Parameters
chrom (
str
) – chromosome namestart (
int
) – region startend (
int
) – region endbamFile (
str
) – BAM file namedistanceBetweenBins (
int
) – The number of bases at the end of each bin to ignore
- :return an np.array, where first column is fragment length, the
second is for read length
- Return type
array
- liquorice.utils.GlobalFragmentSize.get_list_of_fragment_lengths_and_avg_readlength(bam_filepath, n_cores=1, n=1000, upper_limit=800)[source]¶
Sample fragments from the given .bam file to obtain a representative distribution of fragment lenghts.
- Parameters
bam_filepath (
str
) – Path to the .bam file for which fragments should be sampled.n_cores (
int
) – Number of cores to use bydeeptools.getFragmentAndReadSize.get_read_and_fragment_length()
.n (
int
) – Number of randomly sampled fragment lengths to generateupper_limit (
int
) – Fragment lengths exceeding this limit will be excluded. Rarely, fragment size are wrongly inferred and therefore huge. Sampling one of those incorrect lengths would unnecessarily blow up the sequence and mapping information stored for each bin. As a default, 500 is used as a reasonable upper limit of relevant fragment lengths for cfDNA.
- Return type
Tuple
[List
[int
],int
]- Returns
A tuple, consisting of a list of the randomly samples fragment lengths, and an integer value of the average read length.
- liquorice.utils.GlobalFragmentSize.get_read_and_fragment_length(bamFile, return_lengths=False, blackListFileName=None, binSize=50000, distanceBetweenBins=1000000, numberOfProcessors=None, verbose=False)[source]¶
This is a function from deeptools. It was included in LIQUORICE’s source code to allow it to (indirectly) call the modified version of :func`getFragmentLength_worker`. Estimates the fragment length and read length through sampling
- Parameters
bamFile (
str
) – BAM file namereturn_lengths (
bool
) –numberOfProcessors (
Optional
[int
]) –verbose (
bool
) –binSize (
int
) –distanceBetweenBins (
int
) –
- .:return A tuple of two dictionaries, one for the fragment length and the other
for the read length. The dictionaries summarise the mean, median etc. values
- Return type
Tuple
[dict
]
liquorice.utils.MeanSequencingDepth module¶
- liquorice.utils.MeanSequencingDepth.sample_bam_to_get_sequencing_depth(bam_filepath, n_sites_to_sample=10000, n_cores=1, min_mapq=20, chromosomes_list=None, **additional_crpb_kwargs)[source]¶
Estimates the overall sequencing coverage of the .bam file by sampling sites that are regularily distributed across the genome (using
deeptools.countReadsPerBin.CountReadsPerBin()
in a slightly modified version that allows fragments without the SAM flag is_proper_pair.).- Parameters
bam_filepath (
str
) – Path to the .bam file for which mean coverage should be calculated.n_sites_to_sample (
int
) – Number of sites (length 1) to sample.n_cores (
int
) – Number of cores to be used bydeeptools.countReadsPerBin.CountReadsPerBin()
.min_mapq (
int
) – minMappingQuality setting for deeptools.countReadsPerBin.CountReadsPerBin.chromosomes_list (
Optional
[List
[str
]]) – A list of chromosomes that should be analyzed. Regions on chromosomes that are not in this list will be excluded from analysis. Default None means [“chr1”, …, “chr22”].additional_crpb_kwargs (
dict
) – Use to override the default parameters fordeeptools.countReadsPerBin.CountReadsPerBin()
. If not specified otherwise in additional_crpb_kwargs, the following attributes are passed to CountReadsPerBin: ignoreDuplicates=True, samFlag_exclude=256, extendReads=True.
- Return type
float
- Returns
The average coverage determined from the sampled regions.
liquorice.utils.Plotting module¶
- class liquorice.utils.Plotting.Plotting(samplename, out_dir)[source]¶
Bases:
object
An object used to generate plots of the aggragated coverage, differences between corrected and uncorrected coverage, as well as correlations between bias-factors and coverage.
- Parameters
samplename (
str
) – Name of the sample as it should appear in the plotsout_dir (
str
) – Path to the directory to which plots should be saved.
- none_or_list_of_float¶
alias of
Union
[None
,List
[float
]]
- none_or_list_of_int¶
alias of
Union
[None
,List
[float
]]
- none_or_list_of_str¶
alias of
Union
[None
,List
[str
]]
- plot_corrected_vs_uncorrected_coverage(x_values, uncorrected_y_values, corrected_y_values, uncorrected_y_intercept, corrected_y_intercept, filename='auto', return_figure=False)[source]¶
Plots the aggregated coverages, both corrected and uncorrected.
- Parameters
x_values (
List
[int
]) – List of x coordinates - the coordinates of the bins’ centers relative to the center of the regions of interest.uncorrected_y_values (
List
[float
]) – List of aggregated, uncorrected coverage values.corrected_y_values (
List
[float
]) – List of aggregated, uncorrected coverage values.uncorrected_y_intercept (
float
) – Intercept value that should be subtracted from every value in uncorrected_y_valuescorrected_y_intercept (
float
) – Intercept value that should be subtracted from every value in corrected_y_valuesfilename (
Optional
[str
]) – Filename to which the figure should be saved. “auto” sets this to ”out_dir
/corrected_vs_uncorrected_coverage.pdf “. Set to None to avoid saving as file. File extension determines in which format the plot is saved.return_figure (
bool
) – If True, return a matplotlib figure that can be altered, plotted, or saved.
- Return type
Optional
[Figure
]- Returns
A matplotlib figure if return_figure is True, nothing otherwise.
Example:
plotting.plot_corrected_vs_uncorrected_coverage( x_values=fit_gaussians_obj.x_values, uncorrected_y_values=fit_gaussians_obj_uncorrected.unfitted_y_values, corrected_y_values=fit_gaussians_obj.unfitted_y_values, uncorrected_y_intercept=fit_gaussians_obj_uncorrected.intercept_y_values[0], corrected_y_intercept=fit_gaussians_obj.intercept_y_values[0])
- plot_correlations_biasfactors_coverage_bin_nr(df, x_varnames_list=None, y_varnames_list=None, bins=(50, 50), return_figure=False, filename='bias_plots.pdf', percentile_cutoffs=(0.1, 0.99, 0.1, 0.99), fit_regression=True)[source]¶
Plot the correlation between coverage and a bias factor as a heatmap, and fit a trendline.
- Parameters
df (
DataFrame
) – pandas.DataFrame that contains the variables in x_varnames_list and y_varnames_list.x_varnames_list (
Optional
[List
[str
]]) – List of the columns that should be plotted as independent variables. Default None uses all columns in the df except for [“chromosome”,”start”,”end”,”sequence”,”mappability”].y_varnames_list (
Optional
[List
[str
]]) – List of the columns that should be plotted as dependent variables. Default None uses “coverage” and “bin nr.”, plus “corrected coverage” if this column is present in the df.bins (
Tuple
[int
,int
]) – Tuple specifying the number of bins to be used for the x and y axisreturn_figure (
bool
) – If True, return a matplotlib figure that can be altered, plotted, or saved.fit_regression (
bool
) – If True, fit a qadratic polynomial and plot the regression function and R^2.percentile_cutoffs (
Tuple
[int
,int
,int
,int
]) – Tuple containing the min and max percentile of values that should be displayed on the x and y axes: (x_min,x_max,y_min,y_max). This setting is ignored for “bin nr.”, for this column, all values are shown. Percentiles must be in the interval [0,1].filename (
str
) – Filename to which the figure should be saved. File extension determines in which format the plot is saved.
- Return type
Optional
[Figure
]- Returns
A matplotlib figure if return_figure is True, nothing otherwise.
- plot_coverage_bias_correlation(df, biasfactor_column, coverage_column, percent=True, xmin=0, xmax=100, ymin=0, ymax=2, filename='auto', return_figure=False)[source]¶
Plot the correlation between coverage and a bias factor as a heatmap, and fit a trendline.
- Parameters
df (
DataFrame
) – pandas.DataFrame that contains biasfactor_column and coverage_column.biasfactor_column (
str
) – Name of the column that contains data for the bias-factor of interestcoverage_column (
str
) – Name of the column that contains the coverage datapercent (
bool
) – Whether the output plot should multiply the x values by the factor 100xmin (
Union
[int
,float
]) – min x coordinate to show in plotxmax (
Union
[int
,float
]) – max y coordinate to show in plotymin (
Union
[int
,float
]) – min x coordinate to show in plotymax (
Union
[int
,float
]) – max y coordinate to show in plotfilename (
Optional
[str
]) – Filename to which the figure should be saved. “auto” sets this to ”out_dir
/biasfactor_column**__vs__**coverage_column.pdf ” (with spaces in column names replaced by ‘_’). Set to None to avoid saving as file. File extension determines in which format the plot is saved.return_figure (
bool
) – If True, return a matplotlib figure that can be altered, plotted, or saved.
- Return type
Optional
[Figure
]- Returns
A matplotlib figure if return_figure is True, nothing otherwise.
- plot_fitted_model(fit_gaussians_obj=None, x_values=None, unfitted_y_values=None, g1_y_values=None, g2_y_values=None, g3_y_values=None, intercept_y_values=None, combined_model_y_values=None, unfitted_y_values_label='Corrected, aggregated coverage', filename='auto', return_figure=False)[source]¶
Plots the fitted model, i.e. the raw data and lines for G1, G2, G3, the intercept, and the combined model.
- Parameters
fit_gaussians_obj (
Optional
[FitGaussians
]) – object of class FitGaussians, on which fit_gaussian_models() has been called. This contains all nessesary information for plotting, so all other parameters except filename are ignored if this is set.x_values (
Optional
[List
[float
]]) – List of x coordinates - the coordinates of the bins’ centers relative to the center of the region of interest. Required if fit_gaussians_obj is not set, ignored otherwise.unfitted_y_values (
Optional
[List
[float
]]) – List of y coordinates of the unfitted coverage data. Required if fit_gaussians_obj is not set, ignored otherwise.g1_y_values (
Optional
[List
[float
]]) – List of y coordinates for the first, smallest gaussian. Required if fit_gaussians_obj is not set, ignored otherwise.g2_y_values (
Optional
[List
[float
]]) – List of y coordinates for the second, middle gaussian. Required if fit_gaussians_obj is not set, ignored otherwise.g3_y_values (
Optional
[List
[float
]]) – List of y coordinates for the first, largest gaussian. Required if fit_gaussians_obj is not set, ignored otherwise.intercept_y_values (
Optional
[List
[float
]]) – List of y coordinates for the intercept. Required if fit_gaussians_obj is not set, ignored otherwise.combined_model_y_values (
Optional
[List
[float
]]) – List of y coordinates for the fitted, combined model. Required if fit_gaussians_obj is not set, ignored otherwise.unfitted_y_values_label (
Optional
[List
[float
]]) – Label for the unfitted_y_values. By default it is assumed that the corrected, aggregated, unfitted coverage is given.filename (
Optional
[str
]) – Filename to which the figure should be saved. “auto” sets this to ”out_dir
/fitted_gaussians.pdf “. Set to None to avoid saving as file. File extension determines in which format the plot is saved.return_figure (
bool
) – If True, return a matplotlib figure that can be altered, plotted, or saved.
- Return type
Optional
[Figure
]- Returns
A matplotlib figure if return_figure is True, nothing otherwise.
- liquorice.utils.Plotting.polyfit(x, y, degree)[source]¶
fit a trendline to data, see https://stackoverflow.com/questions/893657/how-do-i-calculate-r-squared-using-python-and-numpy
- Parameters
x (
List
[float
]) – x valuesy (
List
[float
]) – y valuesdegree (
int
) – 1 for linear, 2 for quadratic, etc.
- Return type
dict
- Returns
A Dictionary, containing lists with the keys ‘polynomial’ (polynomial coefficients) and ‘determination’ (The coefficient of determination).
liquorice.utils.RegionFilteringAndBinning module¶
- class liquorice.utils.RegionFilteringAndBinning.RegionFilteringAndBinning(bed_filepath, binsize, extend_to, refgenome_chromsizes_filepath, chromosomes_list=['chr1', 'chr2', 'chr3', 'chr4', 'chr5', 'chr6', 'chr7', 'chr8', 'chr9', 'chr10', 'chr11', 'chr12', 'chr13', 'chr14', 'chr15', 'chr16', 'chr17', 'chr18', 'chr19', 'chr20', 'chr21', 'chr22'], blacklist_bed_filepath=None)[source]¶
Bases:
object
A Binning object, which can be used to create a .bed file with the coordinates and bin numbers of every bin. The method
write_bin_bedfile()
of this object can be used to create output.- Parameters
bed_filepath (
str
) – path to a .bed file containing regions that should be extended and split into bins. This could be e.g. a list of DNase I hypersensitivity sites or enhancer peaks.binsize (
int
) – Size of the bins. Use higher values to reduce noise, lower values to increase spatial resolution.extend_to (
int
) – The regions will be extended by this value in both directions. Outmost bins will have their center at <center of the region>*+-*<extend_to>.refgenome_chromsizes_filepath (
str
) – Path to a tab-delimited file containing the chromosome sizes for the reference genome. The first column must be the chromosome name, the second column its size. Can be the .fa.fai file associated to the reference genome. Must include all chromosomes given in chromosomes_list.chromosomes_list (
List
[str
]) – A list of chromosomes that should be analyzed. Regions on chromosomes that are not in this list will be excluded from analysis. Default is [“chr1”, …, “chr22”].blacklist_bed_filepath (
Optional
[str
]) – Optional: .bed file of a black list, such as the one from `here <https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz> `_ for hg38 (unzip first). Regions that overlap any of the regions in this blacklist after extension by extend_to will be excluded from further analysis.
- filter_regions(df)[source]¶
Remove regions that do not pass filters.
- Parameters
df (
DataFrame
) – A pandas.DataFrame of a .bed file (with columns “chromosome”, “start”, “end”)- Return type
DataFrame
- Returns
Filtered DataFrame, with regions excluded if (after extension by
extend_by
) they : i) are not on standard chromosomes (chromosomes_list
), ii) fall withinblacklist_bed_filepath
(if defined), iii) extend beyond chromosome ends, or iv) if their start coordinate is larger than their end coordinate.
- get_binstarts(center)[source]¶
Gets the bin start coordinates for a given center
- Parameters
center (
int
) – Coordinate of the center of the region of interest- Return type
List
[int
]- Returns
A list with start coordinates, one per bin. Length / number of created bins depends on
extend_to
andbinsize
.
- get_binstarts_percentile_split_central_roi(start, end)[source]¶
Gets the bin start coordinates for a given core region, splitting the core region into 5 bins with a length of 10,15,50,15,and 10 % of the core region length, respectively. The other bins, outside the core region, have a length of
binsize'. The most upstream bin starts :attr:
.extend_to` bp upstream of the core region start, and the most downstream bin endsextend_to
bp downstream of the core region end.- Parameters
start – Coordinate of the start of the region of interest
end – Coordinate of the end of the region of interest
- Returns
A list with start coordinates, one per bin. Length / number of created bins depends on
extend_to
andbinsize
.
- is_within_chromsome_borders(chrom, start, end, chromsize_dict)[source]¶
Check if a region is within its chromosome’s borders.
- Parameters
chrom (
str
) – Chromosome namestart (
int
) – Start coordinateend (
int
) – End coordinatechromsize_dict (
Dict
[str
,int
]) – Dictionary with chromosome names as keys and lengths as values. MUST contain chrom as a key, otherwise sys.exit() is called.
- Return type
bool
- Returns
True if the region is within the borders of the chromosome, False otherwise
- write_bin_bedfile(out_bedfile_path_bins, out_bedfile_path_regions_that_passed_filter)[source]¶
Splits every region in the
bins_bed_filepath
file into bins, such that the central bin’s center is at the center of the region, and such that the outermost bins each extendbinsize
/2 over the edges of (<the region’s center> -extend_to
) or (<the region’s center> +extend_to
), respectively. Also assigns a bin nr. to every bin, starting with the most downstream bin.- Parameters
out_bedfile_path_bins (
str
) – Path to which the output .bed file with bin coordinates should be written to.out_bedfile_path_regions_that_passed_filter (
Optional
[str
]) – If not None, write the regions that passed all filters and that are the basis of the bins to this path (should end with .bed).
- Return type
None
- write_bin_bedfile_percentile_split_central_roi(out_bedfile_path_bins, out_bedfile_path_regions_that_passed_filter)[source]¶
Splits every region in the
bins_bed_filepath
file into bins. The core region (given inbed_filepath
) is splitted into 5 bins with a length of 10,15,50,15,and 10 % of the core region length, respectively. The other bins, outside the core region, have a length ofbinsize'. The most upstream bin starts :attr:
.extend_to` bp upstream of the core region start, and the most downstream bin endsextend_to
bp downstream of the core region end. Also assigns a bin nr. to every bin, starting with the most downstream bin. :type out_bedfile_path_bins:str
:param out_bedfile_path_bins: Path to which the output .bed file with bin coordinates should be written to. :type out_bedfile_path_regions_that_passed_filter:Optional
[str
] :param out_bedfile_path_regions_that_passed_filter: If not None, write the regions that passed all filtersand that are the basis of the bins to this path (should end with .bed).
- Return type
None
liquorice.utils.Workflows module¶
- liquorice.utils.Workflows.add_biasfactors_percentile_split(avg_readlength, liq_table, n_cores, sampled_fragment_lengths, skip_these_biasfactors=None)[source]¶
Returns a table with added bias-factors, taking into account different bin sizes. Also adds a “bin size” column.
- Parameters
avg_readlength – Average length of reads in the .bam file.
liq_table – pandas.DataFrame` with one row per bin, containing columns “chromosome”, “start”, “end”, “bin nr.”, “coverage”, “sequence”, and “mappability”. Suitable input is the output of the
get_complete_table()
method of theliquorice.utils.CoverageAndSequenceTablePreparation
class object.n_cores – Max number of cores to use for calculations.
sampled_fragment_lengths – A list containing fragment lengths that are representative of the sample’s global fragment size distribution. Typically a few hundred fragments will suffice here.
skip_these_biasfactors – Do not calculate these bias factors. Only these entries are allowed: [“di and trinucleotides and GC content”,”mappability”, “di and trinucleotides”]
- Returns
A pandas.DataFrame with added bias-factors, taking into account different bin sizes, and a “bin size” column.
- liquorice.utils.Workflows.run_liquorice_on_regionset_with_pretrained_biasmodel(samplename, regionset_name, bam_filepath, bed_filepath, biasmodel_path, refgenome_filepath, refgenome_chromsizes_filepath, refgenome_mappability_bigwig_path, blacklist_bed_filepath, sampled_fragment_lengths, avg_readlength, cna_seg_filepath, mean_seq_depth, n_cores, binsize=500, extend_to=20000, use_default_fixed_sigma_values=True, save_biasfactor_table=False, save_corrected_coverage_table=False, no_chr_prefix=False, percentile_split_core_rois=False, use_this_roi_biasfactortable=None)[source]¶
Run the complete LIQUORICE workflow on a given region-set, using a pre-trained bias model. Main steps of this workflow include: Filtering regions in the input .bed and splitting remaining regions into bins; calculating sequence, coverage, and mappability for every bin; calculating bias factors for every bin, using the pre-trained model and the inferred bias-factors to correct the coverage; aggregating information across regions and fitting gaussian functions and an intercept to the corrected, aggregated coverage data. Also creates plots and result tables.
- Parameters
samplename (
str
) – Name of the sample (to be used in plots and output tables).regionset_name (
str
) – Name of the region-set (to be used in plots and output tables).bam_filepath (
str
) – Path to the .bam file containing the mapped reads for the sample. Does not need to be duplicate-filtered, this is done by the function that calculates the coverage.bed_filepath (
str
) – path to a .bed file containing regions-of-interest that should be extended and split into bins. This could be e.g. a list of DNase I hypersensitivity sites or enhancer peaks.biasmodel_path (
str
) – Path to a trained biasmodel.refgenome_filepath (
str
) – Path to the reference genome .fa(.gz) file. Must have a .fai index in the same dirname.refgenome_chromsizes_filepath (
str
) – Path to a tab-delimited file containing the chromosome sizes for the reference genome. The first column must be the chromosome name, the second column its size. Can be the .fa.fai file associated to the reference genome.refgenome_mappability_bigwig_path (
str
) – Path to a .bigWig file containing (forward) mappability values scaled between 0 and 1.blacklist_bed_filepath (
Optional
[str
]) – .bed file of a black list, such as the one from `here <https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz> `_ for hg38 (unzip first). Regions that overlap any of the regions in this blacklist after extension by extend_to will be excluded from further analysis. Set to None to skip this step.sampled_fragment_lengths (
List
[int
]) – A list containing fragment lengths that are representative of the sample’s global fragment size distribution. Typically a few hundred fragments will suffice here.avg_readlength (
int
) – (Average) length of the reads in the .bam file.cna_seg_filepath (
Optional
[str
]) – If specified, use this .seg file to correct the coverage by the values specified in this file prior to correcting coverage with the bias model. Use this if you want to normalize out the effects of copy number aberrations (CNAs) on the coverage. File must be tab-separated, with column names as first line. The second, third and fourth column must be chromosome, start, and end of the segment, and the last column must be the log2-ratio of observed/expected read depth.mean_seq_depth (
float
) – A float that quantifies the global coverage/sequencing depth. Coverages per bin will be normalized (i.e. divided) by this value.binsize (
int
) – Size of the bins. Use higher values to reduce noise, lower values to increase spatial resolution. Using the same binsize for the biasmodel generation as for the region-set-of-interest is probably preferable.extend_to (
int
) – The regions will be extended by this value in both directions. Outmost bins will have their center at * <center of the region> * +- * <extend_to> *.use_default_fixed_sigma_values (
bool
) – If True, use the following contraints for the sigma values of the small, medium, and large Gaussian, repectively: 149-150 bp, 757-758 bp, and 6078-6079 bp.n_cores (
int
) – Maximum number of cores to use during steps that allow multiprocessing/multithreading.save_biasfactor_table (
bool
) – If True, save a table of bin coordinates, bin number, coverage, corrected coverage and biasfactor values under “coverage_and_biasfactors_per_bin.csv”.save_corrected_coverage_table (
bool
) – If True, save a table of bin coordinates, bin number, coverage, and corrected coverage under “coverage_per_bin.csv”.no_chr_prefix (
bool
) – If True, set the list of allowed chromosomes to [str(i) for i in range(1,23)] instead of [“chr”+str(i) for i in range(1,23)]percentile_split_core_rois (
bool
) – If set, split the central region into 5 bins of variable size instead of always using a fixed binsize. extend_to should not be set to 0 if this is used.
- Return type
None
- liquorice.utils.Workflows.run_liquorice_train_biasmodel_on_same_regions(samplename, regionset_name, bam_filepath, bed_filepath, refgenome_filepath, refgenome_chromsizes_filepath, refgenome_mappability_bigwig_path, blacklist_bed_filepath, sampled_fragment_lengths, avg_readlength, cna_seg_filepath, mean_seq_depth, n_cores, binsize=500, extend_to=20000, biasmodel_output_path='trained_biasmodel.joblib', nr_of_bins_for_training_and_testing=10000, skip_central_n_bins_for_training=0, save_training_table=False, use_default_fixed_sigma_values=True, save_biasfactor_table=False, save_corrected_coverage_table=False, no_chr_prefix=False, percentile_split_core_rois=False, use_cross_validated_predictions=False, use_this_roi_biasfactortable=None, speed_mode=False)[source]¶
Run the complete LIQUORICE workflow on a given region-set, and train the bias-model on data from the same region- set. Main steps of this workflow include: Filtering regions in the input .bed and splitting remaining regions into bins; calculating sequence, coverage, and mappability for every bin; calculating bias factors for every bin, training a bias model, using the trained model and the inferred bias-factors to correct the coverage; aggregating information across regions and fitting gaussian functions and an intercept to the corrected, aggregated coverage data. Also creates plots and result tables.
- Parameters
samplename (
str
) – Name of the sample (to be used in plots and output tables).regionset_name (
str
) – Name of the region-set (to be used in plots and output tables).bam_filepath (
str
) – Path to the .bam file containing the mapped reads for the sample. Does not need to be duplicate-filtered, this is done by the function that calculates the coverage.bed_filepath (
str
) – path to a .bed file containing regions-of-interest that should be extended and split into bins. This could be e.g. a list of DNase I hypersensitivity sites or enhancer peaks.refgenome_filepath (
str
) – Path to the reference genome .fa(.gz) file. Must have a .fai index in the same dirname.refgenome_chromsizes_filepath (
str
) – Path to a tab-delimited file containing the chromosome sizes for the reference genome. The first column must be the chromosome name, the second column its size. Can be the .fa.fai file associated to the reference genome.refgenome_mappability_bigwig_path (
str
) – Path to a .bigWig file containing (forward) mappability values scaled between 0 and 1.blacklist_bed_filepath (
Optional
[str
]) – .bed file of a black list, such as the one from `here <https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz> `_ for hg38 (unzip first). Regions that overlap any of the regions in this blacklist after extension by extend_to will be excluded from further analysis. Set to None to skip this step.sampled_fragment_lengths (
List
[int
]) – A list containing fragment lengths that are representative of the sample’s global fragment size distribution. Typically a few hundred fragments will suffice here.avg_readlength (
int
) – (Average) length of the reads in the .bam file.cna_seg_filepath (
Optional
[str
]) – If specified, use this .seg file to correct the coverage by the values specified in this file prior to correcting coverage with the bias model. Use this if you want to normalize out the effects of copy number aberrations (CNAs) on the coverage. File must be tab-separated, with column names as first line. The second, third and fourth column must be chromosome, start, and end of the segment, and the last column must be the log2-ratio of observed/expected read depth.mean_seq_depth (
float
) – A float that quantifies the global coverage/sequencing depth. Coverages per bin will be normalized (i.e. divided) by this value.binsize (
int
) – Size of the bins. Use higher values to reduce noise, lower values to increase spatial resolution. Using the same binsize for the biasmodel generation as for the region-set-of-interest is probably preferable.extend_to (
int
) – The regions will be extended by this value in both directions. Outmost bins will have their center at * <center of the region> * +- * <extend_to> *.use_default_fixed_sigma_values (
bool
) – If True, use the following contraints for the sigma values of the small, medium, and large Gaussian, repectively: 149-150 bp, 757-758 bp, and 6078-6079 bp.n_cores (
int
) – Maximum number of cores to use during steps that allow multiprocessing/multithreading.save_biasfactor_table (
bool
) – If True, save a table of bin coordinates, bin number, coverage, corrected coverage and biasfactor values under “coverage_and_biasfactors_per_bin.csv”.save_corrected_coverage_table (
bool
) – If True, save a table of bin coordinates, bin number, coverage, and corrected coverage under “coverage_per_bin.csv”.no_chr_prefix (
bool
) – If True, set the list of allowed chromosomes to [str(i) for i in range(1,23)] instead of [“chr”+str(i) for i in range(1,23)]biasmodel_output_path (
str
) – Path to which the trained biasmodel should be saved to. Must have a .joblib extension.nr_of_bins_for_training_and_testing (
Optional
[int
]) – Subset the training_df to this many bins. Can speed up the computation time of the model training, but using too few bins will make the model less precise. To speed up computations, we would recommend decreasing the number of regions in the .bed file rather than altering this parameter, as this is more efficient.save_training_table (
bool
) – If True, save the table that was used to train the biasmodel (coverage and biasfactors per bin) as “training_table.csv”.skip_central_n_bins_for_training (
int
) – The n most central bins will not be used for training the bias model.no_chr_prefix – If True, set the list of allowed chromosomes to [str(i) for i in range(1,23)] instead of [“chr”+str(i) for i in range(1,23)]
percentile_split_core_rois (
bool
) – If set, split the central region into 5 bins of variable size instead of always using a fixed binsize. extend_to should not be set to 0 if this is used.use_cross_validated_predictions (
bool
) – Instead of training on the full dataset, train twice on half of the dataset and predict the other half. Ignores nr_of_bins_for_training_and_testinguse_this_roi_biasfactortable (
Optional
[str
]) – If set use the specified biasfactor table and only train/apply the biasmodel.speed_mode (
bool
) – Only perform GC correction, don’t correct using mappability or di/trinucleotides. Setting this flag makes LIQUORICE considerably faster, but may lead to less accurate results.
- Return type
None
- liquorice.utils.Workflows.train_biasmodel_for_sample(samplename, bam_filepath, bed_filepath, refgenome_filepath, refgenome_chromsizes_filepath, refgenome_mappability_bigwig_path, blacklist_bed_filepath, sampled_fragment_lengths, avg_readlength, mean_seq_depth, cna_seg_filepath, n_cores, extend_to=0, binsize=500, biasmodel_output_path='trained_biasmodel.joblib', nr_of_bins_for_training_and_testing=None, save_training_table=False, no_chr_prefix=False, percentile_split_core_rois=False)[source]¶
Go through all steps of LIQUORICE up to the biasmodel training. The resulting biasmodel under biasmodel_output_path can then be used when LIQUORICE is run for region-sets of interest.
- Parameters
samplename (
str
) – Name of the sample (to be used in plots and output tables).bam_filepath (
str
) – Path to the .bam file containing the mapped reads for the sample. Does not need to be duplicate-filtered, this is done by the function that calculates the coverage.bed_filepath (
str
) – path to a .bed file containing regions that should be used to build the biasmodel. A .bed file that contains many random regions across the genome is recommended.refgenome_filepath (
str
) – Path to the reference genome .fa(.gz) file. Must have a .fai index in the same dirname.refgenome_chromsizes_filepath (
str
) – Path to a tab-delimited file containing the chromosome sizes for the reference genome. The first column must be the chromosome name, the second column its size. Can be the .fa.fai file associated to the reference genome.refgenome_mappability_bigwig_path (
str
) – Path to a .bigWig file containing (forward) mappability values scaled between 0 and 1.blacklist_bed_filepath (
Optional
[str
]) – .bed file of a black list, such as the one from `here <https://github.com/Boyle-Lab/Blacklist/blob/master/lists/hg38-blacklist.v2.bed.gz> `_ for hg38 (unzip first). Regions that overlap any of the regions in this blacklist after extension by extend_to will be excluded from further analysis. Set to None to skip this step.sampled_fragment_lengths (
List
[int
]) – A list containing fragment lengths that are representative of the sample’s global fragment size distribution. Typically a few hundred fragments will suffice here.avg_readlength (
int
) – (Average) length of the reads in the .bam file.mean_seq_depth (
float
) – A float that quantifies the global coverage/sequencing depth. Coverages per bin will be normalized (i.e. divided) by this value.cna_seg_filepath (
Optional
[str
]) – If specified, use this .seg file to correct the coverage by the values specified in this file prior to model training. Use this if you want to normalize out the effects of copy number aberrations (CNAs) on the coverage. File must be tab-separated, with column names as first line. The second, third and fourth column must be chromosome, start, and end of the segment, and the last column must be the log2-ratio of observed/expected read depth.binsize (
int
) – Size of the bins. Use higher values to reduce noise, lower values to increase spatial resolution. Using the same binsize for the biasmodel generation as for the region-set-of-interest is probably preferable.extend_to (
int
) – The regions will be extended by this value in both directions. Outmost bins will have their center at <center of the region>*+-*<extend_to>. Here, the default is 0, meaning only a single bin will be generated per region in the .bed file. This speeds up the computation - for more precise biasmodels, we recommend running for a larger set of regions rather than increasing the extend_to parameter.n_cores (
int
) – Maximum number of cores to use during steps that allow multiprocessing/multithreading.biasmodel_output_path (
str
) – Path to which the trained biasmodel should be saved to. Must have a .joblib extension.nr_of_bins_for_training_and_testing (
Optional
[int
]) – Subset the training_df to this many bins. Can speed up the computation time of the model training, but using too few bins will make the model less precise. To speed up computations, we would recommend decreasing the number of regions in the .bed file rather than altering this parameter, as this is more efficient.save_training_table (
bool
) – If True, save the table that was used to train the biasmodel (coverage and biasfactors per bin) as “training_table.csv”.no_chr_prefix (
bool
) – If True, set the list of allowed chromosomes to [str(i) for i in range(1,23)] instead of [“chr”+str(i) for i in range(1,23)]percentile_split_core_rois (
bool
) – If set, split the central region into 5 bins of variable size instead of always using a fixed binsize. extend_to should not be set to 0 if this is used.
- Return type
None