xtandem_pipeline.tex

\documentclass[10pt,a4paper]{article}
\usepackage[utf8x]{inputenc}
\usepackage{ucs}
\usepackage{amsmath}
\usepackage{amsfonts}
\usepackage{amssymb}
\usepackage[colorlinks=true,urlcolor=blue,linkcolor=black]{hyperref}
\usepackage{graphicx}
\usepackage{fancyhdr}
\usepackage{geometry}

\newcommand{\xtp}{\textbf{X!TandemPipeline}}
\newcommand{\xt}{\textbf{X!Tandem}}

%\usepackage{enumitem}
%\setdescription{labelsep=\textwidth}

\author{Olivier Langella and Benoit Valot\\
\texttt{langella@moulon.inra.fr; valot@moulon.inra.fr}\\
PAPPSO - \url{http://pappso.inra.fr/}\\
\includegraphics[width=1cm]{images/pappso}
}
\title{$\xtp$\\Automated analyses, filtering and export of X!Tandem MS/MS results}
\date{15 November 2013}

%Modification des entetes et pied de page + marges
\geometry{top=3cm, bottom=3cm, left=2cm, right=2cm}
%\pagestyle{headings}
\pagestyle{fancy}
%\fancyhead{}
\fancyfoot{}
\rfoot{\thepage}
\lfoot{\includegraphics[width=1cm]{images/pappso}}


\begin{document}
\maketitle


\begin{abstract}
\href{http://www.thegpm.org/tandem/index.html}{X!Tandem} is an open-source software performing peptide/protein identification from MS/MS mass spectra. X!Tandem is fast and accurate, but the Global Proteome Machine (\href{http://www.thegpm.org/}{GPM}) is relatively limited regarding the processing of identification results. 
$\xtp$ is an alternative to the installation of the GPM on local servers. 

\paragraph*{}
$\xtp$ performs database searching and matching on a list of MS/MS runs in one shot, using a list of easily user selected paramaters and databases.

\paragraph*{}
$\xtp$ also performs filtering of data according to statistical values at peptide and protein levels. Moreover, redundancy of protein databases are fully filtered as follows :
\begin{itemize}
\item proteins identified without specific peptides compared to others are eliminated;
\item proteins identified with the same pool of peptides are assembled;
\item proteins are grouped by function (identified with at least one common peptide), and the specific peptides for each sub-group of proteins are indicated.
\end{itemize}

\paragraph*{}
$\xtp$ allows to view and edit the filtered results, compute the false discovery rate, ... The results can be exported into TSV (Tab Separated Values) files or directly to a spreadsheet software format using ODS (Open Document Spreadsheet).
\end{abstract}

\tableofcontents

\pagebreak 

\section{Installation}

\subsection{License}
\paragraph*{}
Copyright (C) 2010  Olivier Langella and Benoit Valot\\
$\xtp$ program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.\\
This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the \href{http://www.gnu.org/licenses/gpl.html}{GNU General Public License} for more details.

\subsection{Requirements}
$\xtp$ works on all platforms (Linux, Windows and Mac). Java 1.6 must be installed (it can be found :
\href{http://java.com/fr/download/index.jsp}{here}).

\subsection{Third party softwares for Windows and Mac}
Download and install the \xt\ executable from the   
\href{http://www.thegpm.org/tandem/}{\xt\ site}.

\subsection{Third party softwares for Linux}
\subsubsection*{Debian or Ubuntu}
\begin{itemize}
\item Follow instructions on how to install the PAPPSO Debian repository \\
\href{http://pappso.inra.fr/bioinfo/install_ppa_debian.php}{http://pappso.inra.fr/bioinfo/install\_ppa\_debian.php}.
\item Install the \textit{tandem-mass} package.
\item You can also install the \textit{xtandempipeline} package to run \xtp\ 
instead of using the jlnp link.
\end{itemize}
\subsubsection*{Other distributions}
\begin{itemize}
\item Please visit the \href{http://www.thegpm.org/tandem/}{\xt\ site}, and
follow instructions about getting and compile the source code.
\end{itemize}


\subsection{Start X!Tandem pipeline}
\paragraph*{}
To run \xtp, simply :
\begin{itemize}
\item Open X!Tandem pipeline by using this \href{http://pappso.inra.fr/bioinfo/xtandempipeline/xtandempipeline.jnlp}{link}
\item Wait for the program to execute
\item The main window will appear (Fig~\ref{principal})
\end{itemize}

\subsection{Configuration}
At the firt start, the application open the configuration path window:
\begin{itemize}
\item Open the menu \textit{Option $\rightarrow$ Configuration Path} (Fig~\ref{configuration}).
\item Define the path to the X!Tandem executable
\item Choose the folder where to store the X!Tandem parameters (or used default one).
\item Choose the folder where the MS/MS data, the protein databases and the X!tandem results are stored
\end{itemize}

\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/tandem_configuration}
\caption{Configuration window}
\label{configuration}
\end{figure}

\pagebreak 

\section{X!Tandem analysis}
\paragraph*{}
$\xtp$ allows you to analyze peak-lists files by searching a list of protein
databases using the X!Tandem software.
Three successive graphical boxes help you select first the mzXML files or other peak-lists, then the protein databases and finally the folder where the results will be stored. The databases must be protein ones, X!Tandem does not work on DNA databases.

\subsection{Parameters}
\label{parameter}
\paragraph*{}
To perform database searching, you must create or edit a model XML file (stored in the xtandem models folder). Open the menu \textit{Option $\rightarrow$ X!Tandem preset} (Fig~\ref{xtandem_parameter}).

\begin{figure}[!ht]
\center \includegraphics[scale=0.4]{images/tandem_parameter}
\caption{X!Tandem preset window}
\label{xtandem_parameter}
\end{figure}

\paragraph*{}
To use complete performance of your computer, specify the number of CPU in the model : spectrum $\rightarrow$ threads.

\subsection{Running analysis}
\paragraph*{}
To perform analysis, start the menu \textit{File $\rightarrow$ X!Tandem $\rightarrow$ Analysis}. Select on the window (Fig~\ref{xtandem_analysis}) :
\begin{enumerate}
\item Peak-list files to be analyzed (See~\ref{peak})
\item Database files to be searched (See~\ref{database})
\item Searching parameters model (See~\ref{parameter})
\item Folder where to store the result files
\end{enumerate}

\subsection{Peak-lists}
\label{peak}
\paragraph*{}
X!Tandem works with open peak-list files like mzXML, mgf, mzData, mzML or pkl files.

\subsection{Databases}
\label{database}
\paragraph*{}
X!Tandem software uses only protein databases in fasta format. It doesn't work with EST\footnote{Expressed Sequenced tag} sequences. You can transform your database using our application \textit{Protein database manager}, available \href{http://pappso.inra.fr/bioinformatique.html}{here}, or you can directly run it \href{http://pappso.inra.fr/documents/bioinformatique/database_manager.jnlp}{here}.

\begin{figure}[!ht]
\center \includegraphics[scale=0.4]{images/tandem_analysis}
\caption{X!Tandem parameter window}
\label{xtandem_analysis}
\end{figure}

\pagebreak

\section{Processing the results}
\paragraph*{Warning:}
To process results, $\xtp$ needs to have X!Tandem result files (.xml) or Mascot
result files (.dta). The names of the files are used as \textbf{sample names}.

\subsection{Three modes of analysis}
\label{mode}
\paragraph*{}
You can filter the MS/MS identification results and export them in three different modes : (menu \textit{File $\rightarrow$ Load Result})
\begin{description}
\item[Individual mode] \hfill  \\ Each MS/MS result file is processed individually.\\
You cannot perform comparison by using this process.
\item[Combined mode] \hfill  \\ The MS/MS result files are combined in one result file, and this file is filtered / exported.\\
This mode is useful to compare different results.
\item[Phosphopeptide mode] \hfill  \\ Same as the combined mode analysis except that only phosphopeptides are conserved and the result is oriented in order to validate phosphosites.
\end{description} 

\paragraph*{}
In all modes, you have to defined the filter parameters.

\subsection{Filter parameters}
\label{filters}
The filter window (Fig~\ref{filter}) defines the automated filtering process parameters :
\small
\begin{description}
\item[Add files] \hfill \\At this stage, you can add other MS/MS result files to the analysis. If two files have the same name, they are combined in one result file. Interesting if one wants to combine X!Tandem and/or Mascot results of the same LC-MS/MS run using different modification parameters or protein databases.
\item[Peptide E-value] \hfill \\Defines the E-value above which a peptide is considered as valid.
\item[Peptide number] \hfill \\Defines the number of valid unique\footnote{Unique peptides are defined as peptides with different sequences. This excludes peptides with different modifications.} peptides necessary to validate a protein.
\item[Protein E-value] \hfill \\Defines the E-value above which a protein is considered as valid.
\begin{itemize}
\item The protein E-value is the product of its valid unique peptide E-values and it is different from the protein E-values determined by X!Tandem. 
\item The values are expressed in log(E-value).
\end{itemize}
\item[Sum to all] \hfill \\Defines how protein filter is performed when MS/MS results are combined :
\begin{description}
\item[No] To validate a protein, the 2 parameters (peptide number and protein E-value) must be valid in at least one result.
Interesting if one wants to compare SDS-PAGE-LC-MS/MS results, where peptides from a protein are in the same LC-MS/MS run.
\item[Yes] To validate a protein, the 2 parameters (peptide number and protein E-value) must be valid in the sum of all results.
Interesting if one wants to compare 2DLC-MS/MS results, where peptides from a protein are split in different LC-MS/MS runs.
\end{description}
\item[Contaminants] \hfill \\When you perform an analysis using different fasta databases, you can remove the result from one database by selecting this database.
Interesting because it allows you to always include the same contaminant proteins during the database search, and because it removes the contaminant proteins from the results.
\end{description}
\normalsize
\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/tandem_filter}
\caption{Filter window}
\label{filter}
\end{figure}

\pagebreak 


\section{View and edit the results}
\label{viewing}

After loading the results, you can select the result to view in the main window (see ~\ref{main_window}). After this selection, you can navigate in this result in four different windows listed in the menu \textit{Windows} :

\subsection{Main window}
\label{main_window}

\begin{itemize}
\item First frame "Identification Results" : choose the result to edit, displays the current number of samples and groups.
\item False Discovery Rate : estimates an FDR using a reverse/decoy database (see ~\ref{fdr})
\item Mass precision : computes the standard deviation between theoretical and observed mass of peptides (see ~\ref{standard_deviation})
\item Filter identification results : choose criterium to validate identifications as described in ~\ref{filters}
\end{itemize}


\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/tandem_principal}
\caption{Main window}
\label{principal}
\end{figure}


\subsection{Proteins List}
View the list of protein identified on the result. For more details on column see Fig~\ref{prot}.
\begin{itemize}
\item Filter the protein by description;
\item Click on a protein to view the corresponding peptides list (see
~\ref{peptide_list}) and protein details (see ~\ref{protein_details});
\item The checkbox on each protein line allow to validate or unvalidate
corresponding peptides;
\item \textbf{Apply modification} to validate the edition.
\end{itemize}

\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/window_protein}
\caption{Proteins List}
\end{figure}

\subsection{Protein Details}
\label{protein_details}
View the protein sequence and coverage on a identified protein. To view this window, you must open it in the menu \textit{Windows $\rightarrow$ Protein details}.

\begin{figure}[!ht]
\center \includegraphics[scale=0.7]{images/window_protein_detail}
\caption{Protein details}
\end{figure}

\subsection{Peptides List}
\label{peptide_list}
View the peptides identifying a protein. For more details on column see
Fig~\ref{pep}.

\begin{itemize}
\item Filter the peptide by different options;
\item Click on a peptide to view the corresponding MS/MS spectra (see
~\ref{peptide_detail});
\item Uncheck peptide to unvalidate it.
\end{itemize}

\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/window_peptide}
\caption{Peptides List}
\end{figure}

\subsection{Peptides Details}
\label{peptide_detail}
View the MS/MS spectra of an identified peptide.

\begin{itemize}
\item Click on spectra to zoom.
\item Save MS/MS annotated spectra on png or svg.
\end{itemize}

\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/window_peptide_detail}
\caption{Peptides Details}
\end{figure}

\section{Save and Load X!Tandem Pipeline project}
\paragraph{}
You can save all the current results using menu \textit{File $\rightarrow$ Save Project}, or load an previous one using menu \textit{File $\rightarrow$ Load Project}. The extension of created files is \textit{*.xpip}.

\pagebreak

\section{Exporting the results}
You can export the result in different formats in menu \textit{File $\rightarrow$ Export}.

\subsection{Export parameters}
\label{exporting}
The export window (Fig~\ref{export}) shows the different types of available exports :
\small
\begin{description}
\item[Default] \hfill \\Creates TSV files containing identification results for proteins (*protein.txt) and peptides (*peptide.txt). When you perform a combined analysis, a *compar.txt file is created that contains the results of comparison between samples.
\item[Fasta] \hfill \\Creates a fasta file for valid proteins.
\item[PepNovo] \hfill \\Creates a XML file containing the peptide results to be removed for an automated \textit{De Novo} 
interpretation in sequence using our
\href{http://pappso.inra.fr/bioinfo/denovopipeline}{DeNovo pipeline}.
\item[FDR] \hfill \\Creates two tabulated files containing the number of valid peptides or valid proteins for the different 
E-values in each database. Allows you to determine the E-value above which FDR value is acceptable.
\item[Protic] \hfill \\Creates a PROTICdb compatible XML file, so you can store
results in \href{http://pappso.inra.fr/bioinfo/proticdb}{PROTICdb} proteomic
database.
\item[MassChroQ] \hfill \\Creates a MassChroQ compatible XML file, so you can
perform quantitative analysis using our \textbf{MassChroQ} software.
\end{description}
\normalsize
\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/tandem_export}
\caption{Export window}
\label{export}
\end{figure}

\subsection{Files *protein.txt}
The identified proteins are represented by sample (individual mode) or for all
samples (combine/phosphopeptide modes) (Fig~\ref{prot}). Proteins are generally
grouped by function.
\small
\begin{description}
\item[Group] Group to which the protein belongs. All the proteins in a group have at least one peptide in common.
\item[Sub-group] Sub-group to which the protein belongs. All the proteins in a sub-group are identified with the same valid peptides.
\item[Description] Protein description as it appears in the header of the fasta file.
\item[log(E value)] Protein E-value expressed in log.
\begin{itemize}
\item Statistical value representing the number of times this protein would be identified randomly.
\item Calculated as the product of unique peptide E-values in the sample.
\end{itemize}
\item[Coverage] \% of protein coverage.
\item[MW] Molecular weight of the protein expressed in KDa.
\item[Spectra] Total number of MS/MS spectra identified for the protein
\item[Specifics] Number of MS/MS spectra that are specific to the protein, compared to the other proteins of the same group (individual and phosphopeptide mode, see~\ref{mode}).
\item[Specific uniques] Number of unique peptide sequences specific to the protein, compared to other proteins of the same group (combined mode, see~\ref{mode}).
\item[Uniques] Number of unique peptide sequences identified for the protein.
\item[PAI] Protein Abundance Index\label{pai} :
\begin{itemize}
\item PAI estimates the relative abundance of the protein.
\item PAI is calculated as the number of identified spectra divided by the number of theoretical peptides\footnote{Theoretical peptides correspond to the peptides resulting from the theoretical digestion of the protein sequence by trypsin and that are visible in mass spectrometry ($800<MH<2500$)} of the protein.
\end{itemize}
\item[Redundancy] Number of proteins identified with the same pool of spectra. When there is redundancy, the above described parameters are shown only for the first protein of the subgroup (arbitrary chosen).Only the description of the other members of the subgroup is shown.
\item[Position] Position(s) of the phosphosite in the protein. This value is only reported in phosphosite mode (see~\ref{mode}).
\end{description}
\normalsize

\begin{figure}[!ht]
\center \includegraphics[width=1.0\textwidth]{images/tandem_prot}
\caption{Protein results}
\label{prot}
\end{figure}

\subsection{Files *peptide.txt}
Identified peptides are listed by group (Fig~\ref{pep}). One line corresponds to
one MS/MS spectrum identifying one peptide that can be present in one or more proteins.
\small
\begin{description}
\item[Group] Group of the proteins containing this peptide.
\item[Description] Protein description if the peptide is specific to this protein.
\item[Sample] Name of the MS/MS run file.
\item[Scan] Scan number of the MS/MS run analysis.
\item[Rt] Retention time of the peptide.
\item[Sequence] Sequence of the peptide.
\item[Modifs] Modifications on the peptide.
\footnote{For example, M2:+15.99 means that the mass of the second amino acid, which is a methionine, is increased by 15.99. This mass increase indicates that the peptide is oxidized.}
\item[Valid] Indicates whether the peptide was validated by the filter parameters or not.
\item[Used] Number of protein sub-groups in which the peptide is present.
\item[on a total of] Total number of protein sub-groups in the group.\\
\textit{Rq :} If the peptide is specific, there is only $'-'$. 
\item[Sub-groups] Protein sub-groups where the peptide is present.
\item[E-value] Peptide E-value.
\begin{itemize}
\item Statistical value representing the number of times this peptide would be identified randomly.
\item Calculated by X!Tandem with an empiric model.
\end{itemize}
\item[Charge] Charge level of the precursor.
\item[MH+ Obs] Monoisotopic observed mass for the peptide + one proton (MH$^{+}$)
\item[MH+ Theo]Monoisotopic calculated mass for the peptide + one proton (MH$^{+}$)
\item[DeltaMH+] Error in the precursor mass between observed and theoretical data (Da)
\item[Delta-ppm] Error in the precursor mass between observed and theoretical data (ppm)
\item[Position] Position(s) of the phosphosite in the protein. This value is only reported in phosphosite mode (see~\ref{mode}).
\end{description}
\normalsize

\begin{figure}[!ht]
\center \includegraphics[width=1.0\textwidth]{images/tandem_peptide}
\caption{Peptide results}
\label{pep}
\end{figure}

\subsection{Files *compar.txt}
All identified proteins are represented in a list: one protein per row, and one sample per column (Fig~\ref{compar}).
The list of proteins is repeated 4 times, corresponding to the 4 parameters that are used to compare samples (see Type for details).
\small
\begin{description}
\item[Group] Protein group. Groups roughly correspond to the different functions.
\item[Sub-group] Protein sub-group. All the proteins of a sub-group are identified with the same valid peptides.
\item[Description] Protein description extracted from the fasta file.
\item[MW] Molecular weight of the protein (KDa).
\item[log(E value)] The log of protein's E-value.
\begin{itemize}
\item Statistical value representing the number of times this protein would be identified randomly.
\item Calculated as the product of unique peptide E-values in all sample.
\end{itemize}
\item[Type] The item that is compared between samples.
\begin{description}
\item[Spectra] Number of MS/MS spectra identified for the protein.
\item[Specifics] Number of specific MS/MS spectra identified for the protein compared to the other proteins belonging to the same group.
\item[Uniques] Number of unique peptide sequences identified for this protein.
\item[PAI] Protein Abundance Index (~\ref{pai}).
\end{description}
\item[Position] Position(s) of the phosphosite in the protein. This value is only reported in phosphosite mode (see~\ref{mode}).
\end{description}
\normalsize

\begin{figure}[!ht]
\center \includegraphics[width=1.0\textwidth]{images/tandem_comparaison}
\caption{Comparison results}
\label{compar}
\end{figure}

\subsection{Files *fdr.txt}
This result file indicates the number of peptides with an E-value less than the E-value indicated in the first column (Fig~\ref{fdr2}). You just have to divide the number of peptides in the reverse or decoy database by the number of peptides in the normal database to obtain the false discovery rate at each E-value level.\\
This method could be performed if :
\begin{itemize}
\item normal and reverse databases must be saved in different fasta files;
\item X!tandem analysis have been performed with reverse option.\\ 
In this case, the column corresponding to the normal and reverse search are indicated as \textit{xtandem normal} and \textit{xtandem reverse}, respectively.
\end{itemize}

\begin{figure}[!ht]
\center \includegraphics[scale=0.5]{images/tandem_fdr}
\caption{FDR results}
\label{fdr2}
\end{figure}

\pagebreak

\section{Changelog}
\subsection{"Myosine" branch}
\begin{description}
  \item[3.3.1] Better performances. Problems concerning the PAI computation by
  samples are fixed. Updated documentation.
  \item[3.3.0] Grouping of sub-group has been changed for better performances and to fix over-grouping on large datasets (thanks to M. Blein)\\
 If you have a very large dataset, we recommand to reload xtandem results to fix errors.
\end{description}

\subsection{"Tubuline" branch}
\begin{description}
 \item[3.2.2] Corrected report of input parameter on X!Tandem output result (thanks to T. Greko).
 \item[3.2.1] Add new X!Tandem paramaters for multiple search of modifications in one analyse and calculation can now be performed on z > 3.
 \item[3.2.0] Identification from Mascot dat file can now be imported and filtered. All work as X!Tandem result excepts that protein sequence can not be retrieved : PAI and coverage are absent.\\
  Correction of FDR calculation from Reverse/Decoy search.
\end{description}

\subsection{"Kératine" branch}
\begin{description}
  \item[3.1.5] Add support of phosphorylation neutral loss and enhanced ETD detection on MS2 spectra.\\
  Correction of MassChroQ export.
  \item[3.1.4] Add support for viewving ETD spectra after automatic detection.
  \item[3.1.3] Corrected bug of xtandem preset. Refine analysis was never start instead refine param is set to yes.\\
  Adds a new annotated spectrum renderer and bug fix on ODS export.
  \item[3.1.2] Add export results on Open Document Spreadsheet (.ods) file.\\
  Correction of bugs (Grouping, PepNovo export, ...).
  \item[3.1.1] FDR computation are now compatible with reverse option of X!Tandem.
  \item[3.1.0] Algorithm of grouping have been completly rewritten : 
  \begin{itemize}
  \item Older project must be refiltered to be properly grouped.
  \item Phosphopeptide filtering have been enhanced to correspond to :
  \begin{itemize}
    \item SubGroup represents the number of phosphosites
    \item Group represents the number of phosphoproteins
  \end{itemize}
   \item Configuration file have been modified and must be parameter again
\end{itemize}
\end{description}

\end{document}