TopPG:A customized protein database generation tool

1. Overview

TopPG has functions to build different kinds of customized protein databases with genetic variations and/or splicing variations to satisfy the requirements of different applications. To run the TopPG, you need to have python 3.X and corresponding version of pandas, gffutils and biopython package.For the database containing splicing events, you also need to install the bedtools and its corresponding python package.

2. Parameters

-f [ --file ], require the file name of ANNOVAR variants annotation with file extension "exonic_variant_function".

-f1 [ --file1 ], require the name of one output file of rMATS in "AS_Event.MATS.JCEC.txt" format.

-f2 [ --file2 ], require the name of one output file of rMATS in "fromGTF.AS_Event.txt" format.

-g [ --gff ], require the annotation file name in gff3 format.

-r [ --rna ], require the fasta file of reference transcript sequences.You can download the three reference transcript database via the download page.

-o [ --output ], the output name for database, default="customized_db".

-h [ --het ]<0|1|2>, the number of heterozygous genetic variants, default=0.

-s [ --splicing ], add the splicing variations to sequences.

-e [ --exclude ], exclude the sequences without any variations

3. Examples

Generate the customized database with at most one heterozyous variant per sequence.

python generate_db_pipeline.py -f DLD.exonic_variant_function -g gencode.v28.basic.annotation.gff3 -r gencode.v28.transcripts.basic.fasta -h 1

Generate the customized database with splicing variations(Exon-skipping event) and at most one heterozyous variant per sequence.

python generate_db_pipeline.py -f1 SE.MATS.JCEC.txt -f2 fromGTF.SE.txt -f DLD.exonic_variant_function -g gencode.v28.basic.annotation.gff3 -r gencode.v28.transcripts.basic.fasta -h 1 -s