Discovery and Quantification

The main core modules from MicroExonator correspond to the Discovery and Quantification modules. On a regular MicroExonator run, both modules are run however is also posible to run these modules independely. By default, all the downloaded files are temporarely stored in MicroExonator/FASTQ folder, which means that several GBs of disk space may be required.

Configuration

Before running Discovery and Quantification modules is necesary to set all the required parameters inside the config.yaml file.

Minimal configuration

In this seccetion we describe the minimal set of parameters that needs to be defined before running MicroExonator. Since these modules correspond to the core of the pipeline, these parameters are compusory, even if they are not used by the rest of the modules. An example of a config.yaml file that contain all these parameters would look like this:

Genome_fasta : /path/to/Genome.fa
Gene_anontation_bed12 : /path/to/ensembl.bed12
GT_AG_U2_5 : /path/to/GT_AG_U2_5.good.matrix
GT_AG_U2_3 : /path/to/GT_AG_U2_3.good.matrix
conservation_bigwig : /path/to/conservation.bw
working_directory : /path/to/MicroExonator/
ME_len : 30
Optimize_hard_drive : T
min_number_files_detected : 3

Here:

  • Genome_fasta is a multifasta file containg the cromosomes.
  • Gene_anontation_bed12 is a BED file containing the transcript annotation. A collection of these files can be found at UCSC Table Browser.
  • GT_AG_U2_5 and GT_AG_U2_5 are splice site PWMs that can come from SpliceRack. Examples of these inmput files can be found at /PWM Folder. In th case there are not any splice site PWMs available for your specie of interest, you cam assing them NA as a value and MicroExonator will generate the PWM internally based on annotated splice sites.
  • conservation_bigwig is a bigwig file containing genome-wide conservation scores generated by Pylop or PhastCons, which can be downloaded from UCSC genome browser for some species. If you do not have a conservation_bigwig file, you need to create an bigwig which has the value 0 for every position in your genome assembly.
  • working_directory is the path to the MicroExonator folder that you are working with
  • ME_len is microexon maximum lenght. The default value is 30.
  • Optimize_hard_drive can be set as T or F (true of false). When it is set as F, fastq files will be downloaded or copied only once at FASTQ/ directory. This can be inconvenient when you are analysing large amout of data (for instance when your input data is larger than 1TB), because the copy will not be deleted until MicroExonator finish completely. When Optimize_hard_drive is set as T instead, two independet local copies will be generated for the discovery and quantification moduled. rapAs these are temporary copies, every copied fastq file will be deleted as soon as is mapped to the splice junction tags, which means the disk space usage will be set to the minimum while MicroExonator is runing.
  • min_number_files_detected correspond to the minumun number of in which a microexon needs to be found in order to concider it as high confidence. Setting this number to at least 2 is recommended single-end FASTQ files and 3 if paired end files are also present.

Note

If the corresponding files for GT_AG_U2_5, GT_AG_U2_5 or conservation_bigwig are not available for the specie you are currently working on, you can set any of these parameters as NA.

Optional configuration

The following parameters can be specified to further optimize the Discovery and Quantification. The can also be omited, in case there are not relevant or available.

  • ME_DB is a path to a known Microexon annotation file, such as the one that can be obtained from Vast DB. The input format must be in bed12 (additional formats will be supported in future versions)
  • min_reads_PSI is the minimun number of reads that needs supoort the existence of a novel microexon to consider it as high confidence. The default value is 3, but 5 or more is recommended if enough RNA-seq samples are provided.
  • paired_samples is a path to a tab-delimited file with two columns that indicate the correspndace between paired end samples names. This will enable MicroExonator to report a single quantification output per paired-end sample.

Note

In order to access Vast DB annotation files as bed12, you can try by connecting Vast DB track hub to UCSC Genome browser (instructions). Then by using the UCSC Table Browser you will be able to access their alternative splicing event annotation as bed12.

Run

If you are working remotely, we highly recommend creating a screen before running MicroExonator:

screen -S session_name  #choose a meaning full name to you

To activate snakemake enviroment

conda activate snakemake_env

In case you already have an older version of conda use instead:

source activate snakemake_env

Then run

snakemake -s MicroExonator.smk  --cluster-config cluster.json --cluster {cluster system params} --use-conda -k  -j {number of parallel jobs}

Warning

In order to run the command above you need to replace {cluster system params} and {number of parallel jobs} with the appropiate values.

You should use --cluster only if you are working in a computer cluster that uses a queuing systems. We provide an example of cluster.json to work with lsf, in this case cluster system params should be replaced with the specific parameters of lsf. The number of parallel jobs can be a positive integer, the appropriate value depends on the capacity of your machine but for most users a value between 5 and 50 is appropriate. As an example the following command would be able to run MicroExonator using lsf and allowing for 500 parallel jobs.

snakemake -s MicroExonator.smk  --cluster-config cluster.json --cluster "bsub -n {cluster.nCPUs} -R {cluster.resources} -c {cluster.tCPU} -G {cluster.Group} -q {cluster.queue} -o {cluster.output} -e {cluster.error} -M {cluster.memory}" --use-conda -k  -j 500 -np

Before running it is recommended to check if SnakeMake can corretly generate all the steps given your input. To do this you can carry out a dry-run using the -np parameter:

snakemake -s MicroExonator.smk  --cluster-config cluster.json --cluster {cluster system params} --use-conda -k  -j {number of parallel jobs} -np

The dry-run will display all the steps and commands that will be excecuted. If the dry-run cannot be initiated, make sure that you are running MicroExonator from inside the folder you cloned from this repository. Also make sure you have the right configuration inside config.yaml.

Note

If you are working remotelly, the connection is likely to die before MicroExonator finish. However, as long as you are working within an screen, you can re attach the screen and see MicroExonator progress. To list your active screens you can do:

screen -ls

To reattach and detach screens just use:

screen -r session_name  # only detached screen can be reattached
screen -d session_name

Warning

If you have any errors while you are running MicroExonator is useful to read the logs that are reported by the queuing system. Some errors may occur because when not enough memory has been allocated for a given step. Resources for each step can be this can be configured inside cluster.json (check example file at MicroExonator/Examples/Cluster_config/lsf/)

Running large datasets

Since MicroExonator wad developed as an snakemake workfolow, its possible to scale the analysis to big datasets. however there are a copule of recommendations you should keep in mind when you are running a large quantituity of samples.

Limit your downloading jobs

If you are fetching multiple fastq files from NCBI, is posible that some of the download processes will fail. To avoid overloading the connection with NCBI you can limmit the amout of dowloading jobs by assinging a maximun value to a resource called get_data. For example if you want to limmit the downloading process to only run one job at the time, the running command would be:

snakemake -s MicroExonator.smk  --cluster-config cluster.json --cluster {cluster system params} --use-conda -k  -j {number of parallel jobs} --resources get_data=50

Optimize your hardrive space

If you want to process a large dataset, is likely that you will not have enough disk space to temporarely store all the FASTQ files at the same time. In this case we recommend to run the Discovery and Quantification module independely and set Optimize_hard_drive as T on the config.yaml file. In order to do this you use discovery as a target and snakemake will only excecute the Discovery module:

snakemake -s MicroExonator.smk  --cluster-config cluster.json --cluster {cluster system params} --use-conda -k  -j {number of parallel jobs} discovery

By doing this, as Optimize_hard_drive is set as T, downloaded FASTQ files will be deleted as soon as they processed on this module. Once the pipeline finish the discovery module successfuly, you can resume the analysis by running MicroExonator with quant as a target:

snakemake -s MicroExonator.smk  --cluster-config cluster.json --cluster {cluster system params} --use-conda -k  -j {number of parallel jobs} quant

Note that in this case, previously deleted FASTQ files will be downloaded again (or copied if you are using samples that are stored locally). The advantage of doing this is to reduce the amout of space required to to run the analysis as FASTQ files deleted as soon as they are processed these two modules.

Mind the number of files being generated

HPC systems often have a disk qouta limit, but they may also have quota for the maximun number of files that can be generated by each user. In the case you want delete uncesserary files after a MicroExonator run is completed, you can delete the .snakemake/ and logs/ folder.

rm -rf .snakemake logs

Note

If the pipeline gets interrupted or you encounter any error, you can re-intiate MicroExonator by submiting a running command again. This will not run again processes that finished successfuly, but it will only submit the jobs that are required to generate the files that are missing to complete the run. If at any point you want to start from scratch using the same path, you can delete /download folder to ensure every sample is processed again.

Output

The main results of MicroExonator discovery and quantification modules can be found at the Results folder. All the detected microexons that passed though the quantitative filters can be found at out.high_quality.txt. This is a tabular separated file with 14 columns that contain the folowing information:

out.high_quality.txt
Column Description
ME Microexon Coordinates
Transcript Transcript where the microexon was detected
Total_coverage Total coverage across all microexon splice junctions
Total_SJs Splice junctions where the micrexon was detected in
ME_coverages Coma-separated coverage values for each microexon splice junction
ME_length Microexon length
ME_seq Microexon sequence
ME_matches Microexon number of matches inside the intron
U2_score U2 splicing score
Mean_conservation Mean conservation values (if phylop score was provided)
P_MEs Microexon confidence score
Total_ME ME coordinates, U2 score and conservation for all microexon matches
ME_P_value Value used for the final microexon filters
ME_type Microexon type (IN, RESCUED or OUT)

MicroExonator also reports microexons that do not meet the confidence filtering criteria. Detected microexons that are equal or shorter than 3 nt are reported at out_shorter_than_3_ME.txt. Microexons that are longer than 3 nt, but did not have sufficiently low ME_P_value are reported at out_low_scored_ME.txt. Finally, microexons that had ME_P_values below the threshold, but they can also correspond to alternative splicing acceptors or donors (as the microexon sequence matches either at the begining or the end of an intron) are reported at out.ambiguous.txt.

On the other hand, microexon quantification is provided as out_filtered_ME.PSI.txt file. This file contains the following information:

out_filtered_ME.PSI.txt
Column Description
File Sample name
ME_coords Microexon Coordinates
SJ_coords Splice junctions where the micrexon was detected in
ME_coverages Coma-separtated values corresponding to microexon converage on each splice junction
SJ_coverages Coma-separtated values corresponding to converage on each splice junction (exon skiping)
ME_coverages Computed PSI values for a given microexon on a given sample
PSI Lower bound of the PSI confidence interval
CI_Lo Lower bound of the PSI confidence interval
CI_Hi Upper bound of the PSI confidence interval
Alt5 Alternative donor coordinate
Alt3 Alternative aceptor coordinate
Alt5_coverages Alternative donor coverage
Alt3_coverages Alternative donor coverage