In addition to this tutorial, you can find video tutorials on TopPIC Suite (link) and the interpretation of TopPIC and TopMG identifications (link). We thank Dr. David Tabb for making these video tutorials.
In this tutorial, we use TopPIC Suite to analyze two top-down LC-MS/MS data files on a computer with a Windows Operating System. Annotated proteoform spectrum matches (PrSMs) identified by TopPIC from the data files can be browsed here.
Create the folders below for software packages and data sets used in this tutorial.
toppic_tutorial
on the C:
drive of your system.
toppic
in the folder C:\toppic_tutorial\
for
the software TopPIC suite.
tutorial_1
in the folder C:\toppic_tutorial\
.
tutorial_2
in the folder C:\toppic_tutorial\
.
tutorial_3
in the folder C:\toppic_tutorial\
.
The resulting folder structure is shown in the screenshot below.
Msconvert is a software tool in ProteoWizard that converts raw files into various spectrum file formats.
Follow the steps below to download ProteoWizard:
C:\toppic_tutorial\toppic\
.C:\toppic_tutorial\toppic\
.
In the MS experiment, the protein extract of S. typhimurium was reduced with dithiothreitol and alkylated with iodoacetamide. The protein mixture was first separated by gas-phase fractionation, resulting in 7 fractions. Each fraction was separated by an HPLC system coupled with an LTQ-Orbitrap mass spectrometer (Thermo Fisher Scientific). MS and MS/MS spectra were collected at a resolution of 60,000 and 30,000, respectively. In this tutorial, we use only the data files of two fractions (st_1.raw and st_2.raw).
Click here
to download the data set, save it in the folder C:\toppic_tutorial\tutorial_1\
, and unzip it in the same folder.
A S. typhimurium proteome database of 4,533 proteins was downloaded from the UniProt database.
Click here
to download the protein database and save it in the folder
C:\toppic_tutorial\tutorial_1\
.
The folder C:\toppic_tutorial\tutorial_1\
is shown in the screenshot below.
We use TopIndex to generate index files from the protein database. They will speed up database search of TopPIC and TopMG. This step is optional. Skipping index generation only slows the analysis of section 4.5 for database search. While TopIndex supports multithreading, users with a spinning hard disk would experience faster speed when using only one thread instead of multple threads. TopIndex generates very large index files. For example, index files generated for the targe-decoy concatenated UniProt human proteome database are about 240 GB. To achieve high speed index generation, we suggest that a computer with at least 1 TB SSD (Solid State Drive) should be used.
topindex_gui.exe
in the folder
C:\toppic_tutorial\toppic
.C:\toppic_tutorial\tutorial_1\uniprot-st.fasta
.Carbamidomethylation on cysteine
as the fixed modification. Decoy database
.
The screenshot of topindex_gui
is shown below.
TopIndex generates a folder
C:\toppic_tutorial\tutorial_1\uniprot-st.fasta_idx
containing index files.
In the analysis, carbamidomethylation is selected as the fixed modification because proteins were reduced with dithiothreitol and alkylated with iodoacetamide before the MS experiment. When proteins are not reduced, no fixed modification should be selected.
We use MSConvertGUI to convert the raw files st_1.raw and st_2.raw to mzML files.
C:\toppic_tutorial\tutorial_1\st_1.raw
and C:\toppic_tutorial\tutorial_1\st_2.raw
as input
files.The screenshot of MSConvertGUI is shown below.
In the above file format conversion, the peak picking filter (step 3) is used to generate centroid, not profile, mzML data files, which are required by the spectral deconvolution tool TopFD.
The resulting mzML files are
C:\toppic_tutorial\tutorial_1\st_1.mzMLand
C:\toppic_tutorial\tutorial_1\st_2.mzMLThe sizes of the two files are about 41 MB and 47 MB, respectively. They can be downloaded here. The running time for the file format conversion is less than one minute.
We use topfd_gui for top-down mass spectral deconvolution.
topfd_gui.exe
in the folder
C:\toppic_tutorial\toppic
.C:\toppic_tutorial\tutorial_1\st_1.mzML
and C:\toppic_tutorial\tutorial_1\st_2.mzML
as input files.
The screenshot of topfd_gui
is shown below.
TopFD reports eight text files and four folders.
C:\toppic_tutorial\tutorial_1\st_1_ms2.msalign
C:\toppic_tutorial\tutorial_1\st_2_ms2.msalign
C:\toppic_tutorial\tutorial_1\st_1_ms1.feature
C:\toppic_tutorial\tutorial_1\st_1_ms2.feature
C:\toppic_tutorial\tutorial_1\st_2_ms1.feature
C:\toppic_tutorial\tutorial_1\st_2_ms2.feature
C:\toppic_tutorial\tutorial_1\st_1_feature.xml
C:\toppic_tutorial\tutorial_1\st_2_feature.xml
C:\toppic_tutorial\tutorial_1\st_1_html
C:\toppic_tutorial\tutorial_1\st_2_html
C:\toppic_tutorial\tutorial_1\st_1_file
C:\toppic_tutorial\tutorial_1\st_2_file
The output files and folders can be downloaded here.
We use toppic_gui to search the MS/MS spectra in
st_1_ms2.msalign
and st_2_ms2.msalign
against the protein database uniprot-st.fasta
to
identify PrSMs with a variable PTM file var_mods.txt
,
in which oxidation on methionine is set
as a variable PTM. The variable PTM file can be downloaded
here.
toppic_gui.exe
in the folder
C:\toppic_tutorial\toppic
.C:\toppic_tutorial\tutorial_1\uniprot-st.fasta
as the protein
database file.C:\toppic_tutorial\tutorial_1\st_1_ms2.msalign
and C:\toppic_tutorial\tutorial_1\st_2_ms2.msalign
as
mass spectrum data files. Carbamidomethylation on cysteine
as the fixed modification. Decoy database
. FDR
as the spectrum level cutoff type. FDR
as the proteoform level cutoff type.
The screenshots of toppic_gui
are shown below.
For each input msalign file, TopPIC reports four TSV files, two XML files, and collections of HTML files for identified proteoforms. For example, the output files for st_1_ms2.msalign are
C:\toppic_tutorial\tutorial_1\st_1_ms2_toppic_prsm.tsv
C:\toppic_tutorial\tutorial_1\st_1_ms2_toppic_prsm_single.tsv
C:\toppic_tutorial\tutorial_1\st_1_ms2_toppic_proteoform.tsv
C:\toppic_tutorial\tutorial_1\st_1_ms2_toppic_proteoform_single.tsv
C:\toppic_tutorial\tutorial_1\st_1_ms2_toppic_proteoform.xml
C:\toppic_tutorial\tutorial_1\st_1_ms2_toppic_prsm.xml
C:\toppic_tutorial\tutorial_1\st_1_html\toppic_prsm_cutoff
C:\toppic_tutorial\tutorial_1\st_1_html\toppic_proteoform_cutoff
C:\toppic_tutorial\tutorial_1\st_1_html\topmsv
In addition, the identifications reported for st_1_ms2.msalign and st_2_ms2.msalign are combined, and filtered by a 1% spectrum-level FDR and a 1% proteoform-level FDR. The combined results are reported in the following files.
C:\toppic_tutorial\tutorial_1\combined_ms2_toppic_prsm.tsv
C:\toppic_tutorial\tutorial_1\combined_ms2_toppic_prsm_single.tsv
C:\toppic_tutorial\tutorial_1\combined_ms2_toppic_proteoform.tsv
C:\toppic_tutorial\tutorial_1\combined_ms2_toppic_proteoform_single.tsv
C:\toppic_tutorial\tutorial_1\combined_ms2_toppic_proteoform.xml
C:\toppic_tutorial\tutorial_1\combined_ms2_toppic_prsm.xml
C:\toppic_tutorial\tutorial_1\combined_html\toppic_prsm_cutoff
C:\toppic_tutorial\tutorial_1\combined_html\toppic_proteoform_cutoff
C:\toppic_tutorial\tutorial_1\combined_html\topmsv
In the analysis, carbamidomethylation is selected as the fixed modification because proteins were reduced with dithiothreitol and alkylated with iodoacetamide before the MS experiment. When proteins are not reduced, no fixed modification should be selected.
A shuffled decoy database is concatenated
to the target database to estimate spectrum-level and proteoform-level
FDRs. All identified PrSMs are first filtered by a
1% spectrum-level FDR and the resulting PrSMs are reported in the
file combined_ms2_toppic_prsm.tsv
. The proteoforms corresponding to the PrSMs
are further filtered using a 1% proteoform-level FDR and
the resulting proteoforms and their corresponding best PrSMs are reported in the file
combined_ms2_toppic_proteoform.tsv
. Microsoft Excel can be used to open these two files.
To browse the PrSM identifications,
go to the folder combined_html\topmsv
and use Google
Chrome (Windows Edge and Firefox are not recommended)
to open the file index.html.
The output files can be downloaded here.
We use topindex to generate index files from the protein database uniprot-st.fasta
to speed up database search of TopPIC and TopMG.
C:\toppic_tutorial\toppic\topindex.exe
C:\toppic_tutorial\tutorial_1\uniprot-st.fasta
cd C:\toppic_tutorial\tutorial_1
..\toppic\topindex -f C57 -d uniprot-st.fasta
We use topfd for top-down mass spectral deconvolution.
C:\toppic_tutorial\toppic\topfd.exe
C:\toppic_tutorial\tutorial_1\st_1.mzML
C:\toppic_tutorial\tutorial_1\st_2.mzML
cd C:\toppic_tutorial\tutorial_1
..\toppic\topfd st_*.mzML
We use toppic to search the MS/MS spectra in st_1_ms2.msalign
and st_2_ms2.msalign
against the protein database uniprot-st.fasta
to identify PrSMs.
C:\toppic_tutorial\toppic\toppic.exe
C:\toppic_tutorial\tutorial_1\uniprot-st.fasta
C:\toppic_tutorial\tutorial_1\st_1_ms2.msalign
C:\toppic_tutorial\tutorial_1\st_2_ms2.msalign
C:\toppic_tutorial\tutorial_1\st_1_ms1.feature
C:\toppic_tutorial\tutorial_1\st_1_ms2.feature
C:\toppic_tutorial\tutorial_1\st_2_ms1.feature
C:\toppic_tutorial\tutorial_1\st_2_ms2.feature
C:\toppic_tutorial\tutorial_1\var_mods.txt
cd C:\toppic_tutorial\tutorial_1
..\toppic\toppic -f C57 -d -t FDR -T FDR -b var_mods.txt -c combined uniprot-st.fasta st_*_ms2.msalign
We will use TopMG to analyze the data set st_1.raw described in Tutorial 1. TopMG is still in the development stage. Please let us know if you find any bugs in it..
C:\toppic_tutorial\tutorial_2\
, and
unzip it. It includes the following files.
C:\toppic_tutorial\tutorial_2\uniprot-st.fasta
C:\toppic_tutorial\tutorial_2\st_1_ms2.msalign
C:\toppic_tutorial\tutorial_2\st_1_ms1.feature
C:\toppic_tutorial\tutorial_2\st_1_ms2.feature
C:\toppic_tutorial\tutorial_2\variable_mods.txt
C:\toppic_tutorial\tutorial_2\st_1_html
C:\toppic_tutorial\tutorial_2\st_1_file
To speed up database search, follow the steps in Section 4.2.1 to generate index files for the database file uniprot-st.fasta. If index files have been generated, it is not necessary to regenerate index files. You can copy the index folder to the folder C:\toppic_tutorial\tutorial_2\.
topmg_gui.exe
in the folder
C:\toppic_tutorial\toppic
.C:\toppic_tutorial\tutorial_2\uniprot-st.fasta
as the protein
database file.C:\toppic_tutorial\tutorial_2\st_1_ms2.msalign
as a
mass spectrum data file. C:\toppic_tutorial\tutorial_2\variable_mods.txt
as the file of variable PTMs. Carbamidomethylation on cysteine
as the fixed modification. Decoy database
. FDR
as the spectrum level cutoff type. FDR
as the proteoform level cutoff type.
The screenshots of topmg_gui
are shown below.
TopMG reports two TSV files, two XML files, and collections of HTML files for identified proteoforms.
C:\toppic_tutorial\tutorial_2\st_1_ms2_topmg_prsm.tsv
C:\toppic_tutorial\tutorial_2\st_1_ms2_topmg_prsm_single.tsv
C:\toppic_tutorial\tutorial_2\st_1_ms2_topmg_proteoform.tsv
C:\toppic_tutorial\tutorial_2\st_1_ms2_topmg_proteoform_single.tsv
C:\toppic_tutorial\tutorial_2\st_1_ms2_topmg_proteoform.xml
C:\toppic_tutorial\tutorial_2\st_1_ms2_topmg_prsm.xml
C:\toppic_tutorial\tutorial_2\st_1_html\topmg_prsm_cutoff
C:\toppic_tutorial\tutorial_2\st_1_html\topmg_proteoform_cutoff
C:\toppic_tutorial\tutorial_1\st_1_html\topmsv
The output files can be downloaded here.
To browse the PrSM identifications,
go to the folder st_1_html\topmsv
and use Google
Chrome (Windows Edge and Firefox are not recommended)
to open the file index.html.
C:\toppic_tutorial\toppic\topmg.exe
C:\toppic_tutorial\tutorial_2\uniprot-st.fasta
C:\toppic_tutorial\tutorial_2\st_1_ms2.msalign
C:\toppic_tutorial\tutorial_2\st_1_ms1.feature
C:\toppic_tutorial\tutorial_2\st_1_ms2.feature
C:\toppic_tutorial\tutorial_2\variable_mods.txt
cd C:\toppic_tutorial\tutorial_2
..\toppic\topindex -f C57 -d uniprot-st.fasta
..\toppic\topmg -f C57 -d -t FDR -v 0.05 -T FDR -V 0.05 -i variable_mods.txt uniprot-st.fasta st_1_ms2.msalign
We will use TopPIC and TopDiff to compare the abundance of proteoforms and find differentially expressed proteoforms using two MS data files of Escherichia coli cells (ecoli_1.raw and ecoli_2.raw).
In the MS experiment, the protein extract of E. coli was reduced with dithiothreitol and alkylated with iodoacetamide. The protein mixture was separated by capillary zone electrophoresis and analyzed by an LTQ-Orbitrap mass spectrometer (Thermo Fisher Scientific). Technical duplicates were generated for testing proteoform quantification in two runs of the same sample.
C:\toppic_tutorial\tutorial_3\
, and
unzip it. It includes the following files.
C:\toppic_tutorial\tutorial_3\uniprot-ecoli.fasta
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2.msalign
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2.msalign
C:\toppic_tutorial\tutorial_3\ecoli_1_ms1.feature
C:\toppic_tutorial\tutorial_3\ecoli_2_ms1.feature
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2.feature
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2.feature
C:\toppic_tutorial\tutorial_3\ecoli_1_feature.xml
C:\toppic_tutorial\tutorial_3\ecoli_2_feature.xml
C:\toppic_tutorial\tutorial_3\ecoli_1_html
C:\toppic_tutorial\tutorial_3\ecoli_2_html
C:\toppic_tutorial\tutorial_3\ecoli_1_file
C:\toppic_tutorial\tutorial_3\ecoli_2_file
To speed up database search, follow the steps in Section 4.2.1 to generate index files for the database file uniprot-ecoli.fasta. If index files have been generated, it is not necessary to regenerate index files.
We use toppic_gui to search the MS/MS spectra in
ecoli_1_ms2.msalign
and ecoli_2_ms2.msalign
against the protein database uniprot-ecoli.fasta
to
identify PrSMs.
toppic_gui.exe
in the folder
C:\toppic_tutorial\toppic
.C:\toppic_tutorial\tutorial_3\uniprot-ecoli.fasta
as the protein
database file.C:\toppic_tutorial\tutorial_3\ecoli_1_ms2.msalign
and C:\toppic_tutorial\tutorial_3\ecoli_2_ms2.msalign
as
mass spectrum data files. Carbamidomethylation on cysteine
as the fixed modification. Decoy database
. FDR
as the spectrum level cutoff type. FDR
as the proteoform level cutoff type.
The screenshots of toppic_gui
are shown below.
For each input msalign file, TopPIC reports two TSV files, two XML files, and collections of html files for identified proteoforms. As a result, the output files for ecoli_1_ms2.msalign, ecoli_2_ms2.msalign are
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2_toppic_prsm.tsv
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2_toppic_prsm.tsv
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2_toppic_prsm_single.tsv
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2_toppic_prsm_single.tsv
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2_toppic_proteoform.tsv
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2_toppic_proteoform.tsv
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2_toppic_proteoform_single.tsv
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2_toppic_proteoform_single.tsv
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2_toppic_proteoform.xml
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2_toppic_proteoform.xml
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2_toppic_prsm.xml
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2_toppic_prsm.xml
C:\toppic_tutorial\tutorial_3\ecoli_1_html\toppic_prsm_cutoff
C:\toppic_tutorial\tutorial_3\ecoli_2_html\toppic_prsm_cutoff
C:\toppic_tutorial\tutorial_3\ecoli_1_html\toppic_proteoform_cutoff
C:\toppic_tutorial\tutorial_3\ecoli_2_html\toppic_proteoform_cutoff
C:\toppic_tutorial\tutorial_3\ecoli_1_html\topmsv
C:\toppic_tutorial\tutorial_3\ecoli_2_html\topmsv
The output files can be downloaded here.
topdiff_gui.exe
in the folder
C:\toppic_tutorial\toppic
.C:\toppic_tutorial\tutorial_3\ecoli_1_ms2.msalign
and C:\toppic_tutorial\tutorial_3\ecoli_2_ms2.msalign
as
mass spectrum data files.
The screenshots of topdiff_gui
are shown below.
TopDiff reports one TSV file for identified proteoforms with their abundances in the input mass spectrum data
C:\toppic_tutorial\tutorial_3\sample_diff.tsv
The output file can be downloaded here.
C:\toppic_tutorial\toppic\toppic.exe
C:\toppic_tutorial\tutorial_3\uniprot-ecoli.fasta
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2.msalign
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2.msalign
C:\toppic_tutorial\tutorial_3\ecoli_1_ms1.feature
C:\toppic_tutorial\tutorial_3\ecoli_2_ms1.feature
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2.feature
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2.feature
cd C:\toppic_tutorial\tutorial_3
..\toppic\topindex -f C57 -d uniprot-ecoli.fasta
..\toppic\toppic -f C57 -d -t FDR -T FDR uniprot-ecoli.fasta ecoli_*_ms2.msalign
C:\toppic_tutorial\toppic\topdiff.exe
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2.msalign
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2.msalign
C:\toppic_tutorial\tutorial_3\ecoli_1_ms2_toppic_proteoform.xml
C:\toppic_tutorial\tutorial_3\ecoli_2_ms2_toppic_proteoform.xml
cd C:\toppic_tutorial\tutorial_3
..\toppic\topdiff ecoli_1_ms2.msalign ecoli_2_ms2.msalign