The LIQUORICE command-line-tool

Arguments

LIQUORICE: A tool for bias correction and quantification of changes in coverage around regions of interest in cfDNA WGS datasets

usage: LIQUORICE [-h] --bamfile BAMFILE --refgenome_fasta REFGENOME_FASTA
                 --mappability_bigwig MAPPABILITY_BIGWIG
                 [--bedpathlist BEDPATHLIST [BEDPATHLIST ...]]
                 [--bedpath_biasmodel BEDPATH_BIASMODEL] [--binsize BINSIZE]
                 [--extend_to EXTEND_TO] [--blacklist BLACKLIST]
                 [--cna_seg_file CNA_SEG_FILE] [--detect_existing_biasmodel]
                 [--use_this_biasmodel USE_THIS_BIASMODEL]
                 [--extend_to_biasmodel EXTEND_TO_BIASMODEL] [--no_chr_prefix]
                 [--use_this_roi_biasfactortable USE_THIS_ROI_BIASFACTORTABLE]
                 [--speed_mode] [--all_bins_same_size]
                 [--dont_crossvalidate_if_train_on_rois] [--n_cpus N_CPUS]
                 [--tmpdir TMPDIR] [--samplename SAMPLENAME] [--quiet]
                 [--save_training_table] [--save_biasfactor_table]

Required named arguments

--bamfile

.bam file containing the mapped reads of the sample. Used to infer coverage, fragment size, and read length.

--refgenome_fasta

Path to a .fa file of the reference genome. Must have a .fa.fai index in the same directory.

--mappability_bigwig

Path to a bigWig file that contains (forward) mappability values for every base in the reference genome. Can be calculated with gem-mappability for the appropriate read length.

Optional named arguments - General settings

--bedpathlist

List of paths to BED files, one for each region-set of interest. If unspecified, only the biasmodel will be trained (if indicated by the –bedpath_biasmodel, –detect_exisiting_biasmodel, and –use_this_biasmodel settings).

Default: []

--bedpath_biasmodel

.bed file containing regions that are used for generating the bias model for the sample. E.g. random regions should work well. Incompatible with use_provided_biasmodel. If ‘10k_random’ is specified, a set of 10k random regions for hg38 shipped with the package is used for training unless an existing biasmodel can be used. If not specified / None (default), and if –use_this_biasmodel is also not specified, train a seperate biasmodel for each region-set that is specified in –bedpathlist, using the flanking regions (+- extend_to) for each region in the set.

--binsize

Bin size is important for the resolution of the output plots & data, and for the bias model itself. Smaller bin sizes give higher resolution, but take longer to calculate and may result in more noise

Default: 500

--extend_to

Size of the flanking region, in bp. Must be devidable by –binsize and must be a multiple of 2. The regions will be extended by this value in both directions. The most upstream bin starts <extend_to> bp upstream of the core region start, and the most downstream bin ends <extend_to> bp downstream of the core region end. If –all_bins_same_size is set, insteadthe outmost bins will have their center at <center of the region>+-<extend_to>.

Default: 20000

--blacklist

Exclude regions if they overlap with the regions in this .bed file (after extension by –extend_to). Default: None. Set to ‘hg38’ to use the Boyle lab’s hg38-blacklist.v2 that is shipped with LIQUORICE.

--cna_seg_file

If specified, use this .seg file to correct the coverage by the values specified in this file prior to model training or bias correction. Use this if you want to normalize out the effects of copy number aberrations (CNAs) on the coverage. File must be tab-separated, with column names as first line. The second,third,and fourth column must be chromosome, start, and end of the segment, and the last column must be the log2-ratio of observed / expected read depth. This file can be generated e.g. by running ichorCNA on the sample first.

--detect_existing_biasmodel

Check if a bias-model has already been built and saved under ./<samplename>/biasmodel/trained_biasmodel.joblib. If so, use it, otherwise build one using –on a new biasmodel, overwriting files with the same name.

Default: False

--use_this_biasmodel

Use this bias model instead of training a model. IMPORTANT: This model has to come from the same sample/patient as the current one, otherwise the bias correction makes no sense.

--extend_to_biasmodel

Ignored unless –bedpath_biasmodel is set. Size of the flanking region, in bp, to be used for the bias-model. Must be devidable by –binsize and must be a multiple of 2. The regions will be extended by this value in both directions. Outmost bins will have their center at <center of the region>+-<extend_to>. Default 0: Only place a single bin at the center of each provided region to speed up the training process.

Default: 0

--no_chr_prefix

Specify this if the reference genome your .bam files are aligned to uses a chromosome naming scheme such as “1,2,3,..,X,Y” instead of “chr1,chr2,chr3,..,chrX,chrY”, which is the default. Note that if your chromosomes are not named like the default, you must not use the “10k_random” setting for –bedpath_biasmodel or the “hg38” setting for –blacklist. Also, all other input files (refgenome_fasta, mappability_bigwig, bedpathlist, and cna_seg_file must follow the same notation.

Default: False

--use_this_roi_biasfactortable

If set, use the specified biasfactor table and only train/apply the biasmodel, skipping the calculation of coverage and bias factors.

--speed_mode

Only perform GC correction, don’t correct using mappability or di/trinucleotides. Setting this flag makes LIQUORICE considerably faster, but may lead to less accurate results. Currently respected only if –bedpath_biasmodel is not specified.

Default: False

--all_bins_same_size

If set, use always the same bin size (for both the core region provided with –bedpath_list and the flanking regions defined by –extend_to), instead of splitting the core region into bins with sizes corresponding to 10,15,25,15,and 10% of the core region’s length.

Default: False

--dont_crossvalidate_if_train_on_rois

Unless set, if –train_on_rois is specified, train two seperate bias models, each on half of the dataset, and use the trainedmodel to predict the other half.

Default: False

Optional named arguments - Technical settings

--n_cpus

Number of processors to be used whereever multiprocessing/multithreading is used.

Default: 1

--tmpdir

Use this directory as a temporary directory. Default None: search environment variables $TMPDIR,$TEMP,$TMP, and paths /tmp,/var/tmp and /usr/tmp, as well as the current working directory (in this order) until a suitable directory is found.

Optional named arguments - Output settings

--samplename

Name of the sample that is being processed. This will be used for output plots and the names of directories. Default None: Infer from –bamfile by removing .bam extension

--quiet

If set, the log level is set to “warning”, making LIQUORICE less chatty.

Default: False

--save_training_table

If set, save the training DataFrame of the bias model under ./<samplename>/biasmodel/training_table.csv (or ./<samplename>/<region-set name>/training_table.csv if –bedpath_biasmodel is not specified)

Default: False

--save_biasfactor_table

If set, for each region-set, save a table of coverage and biasfactor values per bin under ./<samplename>/<region-set name>/coverage_and_biasfactors_per_bin.csv

Default: False

LIQUORICE’s output

LIQUORICE creates a folder named after the samplename in the current working directory, and places all its output there. For every region-set, a subfolder is created. Within these folders, the following files are generated by default:

  • bins.bed: Genomic coordinates of all bins that passed the filtering and were used to generate the result. The fourth column contains the bin number (0 = most upstream bin of the region).

  • regions.bed: Genomic coordinates of all regions that passed the filtering and were used to generate the bins.

  • corrected_coverage_mean_per_bin.csv: The bin-wise, aggregated, bias-corrected coverage values.

  • uncorrected_coverage_mean_per_bin.csv: The bin-wise, aggregated, un-corrected coverage values.

  • corrected_vs_uncorrected_coverage.pdf: A plot showing the aggregated coverage profiles, comparing corrected and uncorrected values.

  • fitted_gaussians.pdf: A plot showing the fitted model, its seperate elements, and the bin-wise, aggregated, bias-corrected coverage values.

  • fitted_gaussians_parameter_summary.csv: A table summarizing the parameters of the fitted model. This includes the dip area and depth, the two main metrics that quantify the epigenetic signal.

  • GC_content__vs__corrected_coverage.pdf: A heatmap-style correlation plot, showing the correlation between the GC content and the corrected coverage value in each bin.

  • GC_content__vs__coverage.pdf: A heatmap-style correlation plot, showing the correlation between the GC content and the uncorrected coverage value in each bin.

LIQUORICE’s summary tool

Useful for generating summaries across samples for multiple region-sets:

Arguments

Summarize output of LIQUORICE for multiple samples and regions of interests in a table and plots. Please make sure that for for any given region-set LIQUORICE has been run with consistent ‘binsize’ and ‘extend_to’ settings across samples.)

usage: LIQUORICE_summary [-h] [--dirname DIRNAME]
                         [--control_name_list CONTROL_NAME_LIST [CONTROL_NAME_LIST ...]]
                         [--binsize BINSIZE] [--extend_to EXTEND_TO]
                         [--smooth_sigma SMOOTH_SIGMA]
                         [--use_uncorrected_coverage]
                         [--these_regionsets_only THESE_REGIONSETS_ONLY [THESE_REGIONSETS_ONLY ...]]
                         [--prediction_interval_alpha PREDICTION_INTERVAL_ALPHA]

Optional named arguments

--dirname

Directory in which LIQUORICE’s output is contained. Can contain output of multiple samples.

Default: “.”

--control_name_list

List of samples that serve as reference control samples. Used toinfer z-scores. Please do not surround the list with quotation marks: Example: –control_name_list sample1 sample2 would be correct, –control_name_list “sample1 sample2” would be incorrect.

Default: []

--binsize

–binsize setting that was used for LIQUORICE. Default: infer automatically

--extend_to

–extend_to that was used for LIQUORICE. Default: infer automatically

--smooth_sigma

Determines how strong the coverage drops should be smoothed for the overlay plot. Set to 0 for no smoothing.

Default: 2.0

--use_uncorrected_coverage

Per default the corrected coverages will be plotted. Set if you want to plot/summarize the uncorrected coverage instead.

Default: False

--these_regionsets_only

List of region sets for which a summary should be calculated. Default: summarize all detected region-sets

Default: False

--prediction_interval_alpha

Alpha level for the prediction interval. Samples are deemed significantly different from the controls if their score lies outside the prediction interval of the control group. Note: Significance testing assumes a normaldistribution of the scores of the control group. If tests for normality fail, assessment of significant differences is unavailable. Default 0.05: 95 precent prediction interval.

Default: 0.05

Output of LIQUORICE’s summary tool

In the current working directory, LIQUORICE_summary generates a file “summary_across_samples_and_ROIs.csv”, in which the individual fitted_gaussians_parameter_summary.csv files are summarized. If --control_name_list is specified, this also includes information on the significance of difference between a given case sample and the control samples. Two violinplots, “boxplot_summary_by_dip_<comparison_metric=[area,depth]>.pdf” are generated, in which the scores across groups of samples (if defined) and region-sets are summarized. Additionally, for each region-set, the following files are generated:

  • <regionset>_summary_plot_overlay_of_samples_by_Dip_<comparison_metric=[area,depth]>.pdf: Two side-by-side overlay plots (separate for control and case samples), showing the aggregated, bias-corrected coverage profile. Lines are colored by significance of differences between a given case sample and the control samples.

  • <regionset>_summary_plot_subplots_of_samples_by_Dip_<comparison_metric=[area,depth]>.pdf: A pdf document (multi-page if necessary) in which the aggregated, bias-corrected coverage profile is shown as a seperate plot for every sample. Lines are colored by significance of differences between a given case sample and the control samples.

  • <regionset>_control_distribution_dip_<comparison_metric=[area,depth]>.pdf: Shows the distribution of scores of the control samples as histograms and probability plots. This can help assess whether the control samples follow a normal distribution.

If the option --use_uncorrected_coverage is chosen, all output files will have the suffix “_using_uncorrected_coverage.<[pdf,csv]>”.

Assessment of significant differences

“Case” samples are classified as “significantly different” from the control group if they have a score (dip area or dip depth) that is outside of the prediction interval of the control group.

Here, the prediction interval is an estimate of an interval in which a future observation of a control sample will fall, with a certain probability (default: 95%), given the control samples that have already been observed. The prediction interval depends on the mean, standard deviation and number of samples of the control group - the smaller the standard deviation and the higher the sample number, the narrower the prediction interval will be (and therefore, the better significantly different “case” samples can be detected).

The calculation assumes that the control samples’s scores follow a normal distribution. LIQUORICE_summary performs the Shapiro-Wilk test for normal distribution: If this test detects significant deviations from a normal distribution, LIQUORICE_summary will display a warning. Note that test for normality may fail to detect relevant deviations from normal distributions when the sample size (number of control samples) is low. So, generally you should treat the results of the significance testing with care if you have only very few samples in the control group. You can also assess the distribution of control scores by looking at the plots named <regionset>_control_distribution_dip_<comparison_metric=[area,depth]>.pdf. If the histogram looks very much unlike a normal distribution, and if the data points in the probability plot strongly/systematically deviate from the reference line, the results of LIQUORICE_summary’s significance testing may be invalid.

Usage, Parameters, and Examples

Advice on parameter settings

Apart from the input data, the parameters with the strongest influence on LIQUORICE’s results and runtime are:

  • binsize (speed, memory & outcome)

  • extend_to (speed, memory & outcome)

  • n_cpus (speed, memory)

  • bedpath_biasmodel (speed, memory & outcome)

  • all_bins_same_size (outcome)

  • speed_mode (speed, memory & outcome)

A few words of advice on how to set these parameters properly:

binsize and extend_to

We haven chosen a default binsize of 500bp, and a default extend_to setting of 20kb, because these settings have worked best for us - for the epigenetic signatures we have studied so far. The optimal settings may be different for your application. If you want to analyze signals that are very wide, and you observe that the coverage profile is not fully flat at the edges of the plot (say, if the profile is not flat in the outermost 5kb on both sides), you can try to increase the extend_to parameter. Likewise, if you are observing very narrow signals, you can try decreasing extend_to accordingly, as well as try to decrease the binsize parameter.

Note that the larger the –extend_to parameter, and the smaller the binsize parameter, the longer the runtime (and memory usage) of LIQUORICE will be.

bedpath_biasmodel, use_this_biasmodel and detect_existing_biasmodel

If --bedpath_biasmodel is not specified (default), LIQUORICE does the following:

  • Train a seperate bias-model for each region-set (i.e. for each entry in --bedpathlist).

  • For a given region-set, each region is extended by --extend_to in both directions and split into bins.

  • As training data for the bias-model, all bins are used except (i) the 5 central bins that cover the core region, if --all_bins_same_size is not specified; or (ii) the one central bin, if --all_bins_same_size is specified.

  • All bins are used to determine the composite, bias-corrected coverage signature.

If --bedpath_biasmodel is specified, LIQUORICE does the following:

  • Train a single, common bias-model for every region-set (i.e. for each entry in --bedpathlist).

  • Each region in the .bed file specified under --bedpath_biasmodel (or each region in LIQUORICE’s own set of 10000 random regions if --bedpath_biasmodel 10k_random is specified) is extended by --extend_to_biasmodel (default: 0) in both directions and split into bins.

  • The resulting bins are used as training data for the bias-model.

Note that if --use_this_biasmodel is specified, neither of the above workflows is executed. Instead, the provided pre-trained model is used for the correction of all bins of every region-set (i.e. for each entry in --bedpathlist). The same applies if --detect_existing_biasmodel is specified and a valid model is present under <samplename>/biasmodel/trained_biasmodel.joblib.

We have decided to set the former option as default because it yielded somewhat better results for our own samples. However, we do encourage you to try both options for your own cohort. The latter option will also likely be a bit faster in case you are running LIQUORICE on a large number of region-sets.

all_bins_same_size

By default (i.e. if --all_bins_same_size is not specified), LIQUORICE does the following:

  • Split each region-of-interest into five bins with sizes corresponding to bins of 10%, 15%, 50%, 15%, and 10% of the total length of the region, respectively. This is done in order to facilitate comparisons between regions of different lengths within the same region set. After splitting, every site consists of five bins, regardless of the initial length of the region.

  • Next, the adjacent genomic region (-- extend_to basepairs to both sides) is split into bins of --binsize bp size. The most upstream bin starts extend_to bp upstream of the core region start, and the most downstream bin ends extend_to bp downstream of the core region end.

If --all_bins_same_size is specified, LIQUORICE does the following:

  • Use a size of --binsize bp for all bins, also the ones at the center.

  • The central bin is centered around the center of the region-of-interest. The other bins are tiled such that no gaps arise. Outmost bins will have their center at <center of the region>+-extend_to.

Also here, we have decided to set the former option as default because it yielded somewhat better results for our own samples. We do note, however, that differences in bin-size might introduce some slight biases in the coverage profile. While we have nevertheless found that this option works well for us, we do encourage you to try both options for your own cohort.

Parallelization

Increasing the n_cpus parameter will cause LIQUORICE to use more threads during the steps that are parallelized, and speed up the analysis. A (potentially faster) alternative to using this setting is to parallelize at the sample level, using GNU parallel (http://dx.doi.org/10.5281/zenodo.16303), which is automatically installed together with LIQUORICE :

SAMPLES="Sample1 Sample2 Sample3 Sample4 Sample5 Sample6 Sample7 Sample8 Sample9 Sample10"
NR_OF_CORES_TO_BE_USED=5

# Write a simple bash file that contains all required parameters for liquorice, and takes the sample name as an argument
# Replace the paths and file name according to your file locations.
echo 'LIQUORICE --bamfile "PATH_TO_BAMFILES/${1}.bam" --refgenome_fasta PATH_TO_REFERENCE_GENOME/hg38.fa --mappability_bigwig PATH_TO_MAPPABILITY_BW/hg38_mappability_75bp.bigwig --bedpathlist "PATH_TO_REGIONSETS/YOUR_REGIONSET_OF_INTEREST.bed" --blacklist hg38 --n_cpus 1 --cna_seg_file "PATH_TO_SEGFILES/${1}.seg"' >LIQUORICE_command.sh

parallel --results logs -j ${NR_OF_CORES_TO_BE_USED} bash LIQUORICE_command.sh  ::: ${SAMPLES}

Note that the memory usage will increase with the number of parallel jobs (set by the -j parameter of parallel). We usually allow for 3GB of RAM for each job executed in parallel and set the -j parameter accordingly ( j = <Total available Memory on the Computer/Server>/3 GB) when running LIQUORICE with default settings on a region-set of 6000 regions. Note that memory usage also depends on extend_to, binsize, speed_mode, and scales linearly with the number of regions in your region-sets. Finally: LIQUORICE’s results will slightly differ based on whether you use --n_cores 1 or --n_cores <anything larger than 1>. This is due to differences in the sampling of fragment lengths and nothing to worry about - both results are equally valid.

Sources for input files

  • bamfiles: Use your own (or publically available) paired-end whole genome sequencing data from liquid biopsies here. Data should be quality-controlled and trimmed (e.g. using fastp with default settings) as well as mapped (we have used bwa mem). We have found that higher sequencing depth improves results - from our own experience, we would recommend using a depth of at least 1x (or higher, if possible). For details, see Figure 6 of our recent publication.

  • mappability .bigwig files: This reference file should match i) your reference genome (e.g. hg38, hg19, …) and ii) the read-length of your samples. We provide pre-calculated files for hg38/hg19 and readlengths 35,50,75,100,150 and 250 here. If you require a different read length or reference genome, you can run create_mappability_bigwigs.sh like so:

    wget https://github.com/epigen/LIQUORICE/raw/master/liquorice/create_mappability_bigwigs.sh
    bash create_mappability_bigwigs.sh PATH_TO_GENOME_FASTA READLENGTH NR_OF_CORES_TO_BE_USED
    

This will create a mappability bigwig file in the current directory.

  • regionsets-of-interest (bedpathlist): See Region-sets.

Test LIQUORICE with provided test data

To test whether your installation of LIQUORICE works as expected, you can test it on a small test dataset from a healthy control sample that we provide. Just follow the example below:

# Set desired nr. of cpus
N_CPUS=5

# download and unzip the reference genome and reference mappability file
wget https://hgdownload.soe.ucsc.edu/goldenPath/hg38/bigZips/p12/hg38.p12.fa.gz
gunzip hg38.p12.fa.gz
wget https://github.com/epigen/LIQUORICE/raw/master/liquorice/data/hg38.p12.fa.fai
wget https://github.com/epigen/LIQUORICE/raw/master/liquorice/data/hg38.fa.mappability_100bp.subsetted_for_testdata.bw

# download .bam file of a healthy control liquid biopsy sample (pre-processed to keep the size small)
wget https://github.com/epigen/LIQUORICE/raw/master/liquorice/data/Ctrl_17_testdata.bam
wget https://github.com/epigen/LIQUORICE/raw/master/liquorice/data/Ctrl_17_testdata.bam.bai


# download .bed file for universally accessible DHSs
wget https://github.com/epigen/LIQUORICE/raw/master/liquorice/data/universal_DHSs.bed

# run LIQUORICE
LIQUORICE --bamfile Ctrl_17_testdata.bam --refgenome_fasta "hg38.p12.fa" \
        --mappability_bigwig "hg38.fa.mappability_100bp.subsetted_for_testdata.bw" \
        --bedpathlist "universal_DHSs.bed" \
        --blacklist "hg38" --n_cpus "${N_CPUS}" --extend_to 15000

Example usage of LIQUORICE and the summary tool

# Run LIQUORICE for 4 samples and 3 region-sets, and summarize the results:

SAMPLES="sample1 sample2 sample3 sample4"
CONTROLS="sample1 sample2"
BAMS="./bams"
BEDS="./regionsets"
HG38="./hg38"

for SAMPLE in "$SAMPLES"
do
LIQUORICE --bamfile "${BAMS}/${SAMPLE}.bam" --refgenome_fasta "${HG38}/hg38.fa" \
    --mappability_bigwig "${HG38}/hg38_mappability_75bp.bw" \
    --bedpathlist "${BEDS}/regionset1.bed" "${BEDS}/regionset2.bed" "${BEDS}/regionset3.bed" \
    --blacklist "hg38" --n_cpus 8
done

LIQUORICE_summary --control_name_list ${CONTROLS}
# Please make sure to not put quotation marks around the list specified for --control-name-list.
# Example: --control_name_list sample1 sample2 would be correct, --control_name_list "sample1 sample2" would be incorrect.