Commit 4a87a00f authored by Edlira Nano's avatar Edlira Nano
Browse files

manual added post-matching

git-svn-id: https://subversion.renater.fr/masschroq/trunk@2258 e4b6dbb4-9209-464b-83f7-6257456c460c
parent 57e1b627
......@@ -158,7 +158,7 @@
\vskip 4cm
\textbf{\Huge MassChroQ manual}\\[0.5cm]
\Large First edition for MassChroQ 1.0 \emph{Hungry Crocklet}\\[1cm]
\Large First edition for MassChroQ version 1.2 \emph{Hungry Crocklet}\\[1cm]
\small
Author: Edlira \textsc{Nano}\\
Contributors: Olivier \textsc{Langella}, Beno\^it \textsc{Valot},
......@@ -273,28 +273,41 @@ format data.
\ei
{\M} is developed in the C\verb!++! language using the Qt
framework. Version $1.0$ is its first public release.
framework. Version $1.2$ is its latest public release.
{\M} comes as a stand-alone command-line program and
also with a library for integration in other softwares or proteomic
pipelines.
On the {\M} homepage (\href{\sitemasschroq}{\sitemasschroq})
you can find download instructions, various documentation files and
you can find download and install instructions, various documentation files and
the latest news about this project.
On the {\M} development page hosted by SourceSup at \linebreak
\href{http://sourcesup.cru.fr/projects/masschroq/}{http://sourcesup.cru.fr/projects/masschroq/}, you will
find a subversion repository, a bug tracker and various forums.
find a subversion repository, a bug tracker and forums.
The source code is anonymously available via direct access to the
subversion repository from
\href{https://subversion.cru.fr/masschroq/}{https://subversion.cru.fr/masschroq/}.
s
\href{https://subversion.cru.fr/masschroq/}{https://subversion.cru.fr/masschroq/}.
Feel free to contribute to the {\M} project by directly contacting one
of its authors.
\section{What is new in {\Mv}}
The main novelty in {\Mv} is the introduction of the peptide peak post-matching feature.
Indeed, before version $1.2$, {\M} performed peak
matching during peptide quantification peak by peak. The post-matching mode
adds a peak matching step at the end of the quantification in each
group : for each peptide to be quantified, its previously unmatched
peaks are rematched by taking into account the
retention times of the previously matched peaks of this peptide in the
group. This gives a finer retention time computation for each
peptide, and allows matching of previously missed peaks in some cases.
For more details on the post-matching mode and when to use it, see section
\ref{peak_match}.
\section{{\M} features overview}\label{groups-sec}
......@@ -321,36 +334,12 @@ To achieve this {\M} can combine and perform the following features :
\item Detection of peaks on these XICs.
\item Quantification of the predefined items of interest (two
different quantification methods are proposed).
\item Peak matching during and after quantification.
\item Grouping of LC-MS data that present similarities (for
example grouping of the same LC fractions in an SCX fractionated analysis
in order to perform alignment on them).
\end{itemize}
{\M} uses the notion of \emph{groups} of LC-MS data according to their
technical similarities. Grouping affects alignments: all runs from the
same group will be aligned with the same method;
and quantification: peak detection and quantification will be
performed in all runs of the same group for peptides that were
identified in at least one run of this group.
Groups also give the user the
possibility to perform specialized analysis on several different sets of
data in one shot.
For example, in a peptide SCX separation experiment,
only the samples of the same LC fraction should be aligned to each
other. In that case, we form a group of these samples in
MassChroQ and we assign the desired alignment method to it.
We can do the same with quantification methods: suppose we have a set of runs
obtained with an HR Orbitrap spectrometer (which is known for producing
artifact signal spikes) and another set of runs obtained with
an LR LTQ spectrometer (which produces a certain baseline noise but no
spikes). We can group the Orbitrap runs together and put the LTQ ones in another
group. We then apply a quantification method containing an anti-spike
XIC filter to the first group, and another quantification method containing
a background filter to the latter and tell masschroq to perform
analysis in both groups in one shot.
{\M} accepts mzXML as well as mzML LC-MS data formats.
To include the identified peptides/isotopes in a {\M} analysis there
......@@ -383,7 +372,7 @@ On the {\M} homepage (\href{\sitemasschroq}{\sitemasschroq}) you can find :
\item masschroq's \href{http://pappso.inra.fr/downloads/masschroq/masschroq.xsd}{schema};
\item Dataset examples of masschroqML input files to {\M} for various
ordinary situations (fractionated sample, isotopic labeled ones, etc);
\item this user manual frequently updated;
\item this user manual, frequently updated;
\item the latest news and the upcoming features on this project;
\item BibTeX and text entries for MassChroQ citation.
\ei
......@@ -394,9 +383,11 @@ On the {\M} project page hosted on
\item a \href{https://subversion.cru.fr/masschroq}{subversion repository};
\item a bug tracker;
\item several user and developer forums.
The source code of {\M} contains C++ \emph{Doxygen} documentation, which you can generate and use for development needs.
\ei
The source code of {\M} contains C++ \emph{Doxygen} documentation,
which you can generate and use for development needs.
\chapter{Installing and running {\M}}\label{running-sec}
\section{Installation}
......@@ -706,20 +697,76 @@ know and we will try to implement it sooner than scheduled.
\section{Grouping of LC-MS runs}
In {\M} the user defines groups of LC-MS runs. As explained in section
\ref{groups-sec}, the user is supposed to group the runs presenting
technical similarities (for example a group of samples of
the same fraction, or a group of samples obtained from an LTQ low
resolution spectrometer, etc.).
The user can define several different
groups in the same analysis. He can then define different alignment
methods and different quantification methods for each of these
groups. This allows him to run specialized analysis on several
different set of samples in one shot.
Groups do not only offer flexibility, they are also helpfull with some
extra possibilities that {\M} implements:
{\M} uses the notion of \emph{groups} of LC-MS data according to their
technical similarities : for instance, in case of fractionation, samples of
the same fraction will be grouped together in order to be aligned together,
or a group of samples obtained from an LTQ low resolution spectrometer
can be grouped together so that the appropriate xic extraction range
and xic filtering can be applied on them, etc. It is up to the user to
define the different groups pf LC-MS runs in their analysis.
Here are the main grouping possibilities that can be performed in
MassChroQ, according to the analysis needs.
\subsection*{Fractions grouping}
If peptide or protein pre-fractionation has been performed on your
samples, samples of the same fractions should be grouped together. For
example, ina peptide SCX fractionation experiment, suppose you have 3
samples $A$, $B$ and $C$, each fractionated in 10 fractions: $A_1$,
$A_2$ ... , $A_{10}$, $B_1$, ..., $B_{10}$. You should
define 10 groups, the first one containing the samples $A_1$, $B_1$
and $C_1$, the second one containing samples $A_2$, $B_2$ and $C_2$
and so on. This way, only the same fractions will be aligned to each other.
\subsection*{Alignment grouping}
All runs of the same group will be aligned together with the same
alignment method. Thus, only samples presenting technical similarities should be
grouped and aligned together. In the fractionation example above, only
samples of the same fractions can be aligned one to another, it has no
sense to align fractions in $2$ with those in $3$ or $1$.
Another case of use of the alignment grouping is when you have two
differently obtained experiments, for instance some samples obtained
from a low resolution spectrometer, and some others obtained from a
high resolution one. The alignment parameters can be adjusted to each
of these experiments: for example the high resolution samples have
much more MS level 2 acquisition points, thus an MS2 alignment method
with bigger smoothing window parameters than for the low resolution
experiment, can be more appropriate and give better aligned retention
times.
You can run an analysis with {\M} on both experiments in one shot by
simply defining two different groups and two alignment methods, one
for each group.
\subsection*{Quantification grouping}
XIC extraction, XIC filtering, peak detection and quantification will be
performed in all the runs of the same group with the same
quantification method.
So, in the example of the high and low resolution experiments above,
the XIC extraction range parameter, which depends on the
spectrometer's range, should be different for each experiment. Also,
some low resolution spectrometers generate an important baseline
noise, whereas some high resolution ones generate spikes. Thus the XIC
filtering should be different for each experiment. By grouping the low
resolution samples together and the high resolution samples in a
different group, by defining two different quantification methods, one
for each group of samples, we can perform specialized analysis on our
two groups in one shot. The results are then easier to be compared and
statistically analysed.
\subsection*{Extra features of the grouping}
Groups give the user the
possibility to perform specialized analysis on several different sets of
data in one shot.
But technically speaking, they also allow the following extra features in
{\M}:
\begin{description}
\item[Efficient XIC extraction:] XICs for a given identified
peptide will only be extracted in groups where the MS/MS allowed its
......@@ -736,7 +783,6 @@ run, associating to each identified peptide (or other chosen entity)
its quantitative value in every group and in every run of the
analysis. They allow easy statistical analysis without ambiguity.
\section{XIC extraction}
The underlying operating items in {\M} are the XICs. Whatever
......@@ -1146,48 +1192,156 @@ alignment files.
\section{Peak matching}\label{peak_match}
After alignment, xic extraction and peak detection, {\M} performs peak
matching: the detected peaks are assigned to the peptides or other
entities being quantified. Peak matching in {\M} is based on retention
times, it is performed as follows: the
quantitative value of a peak (i.e. the peak area)
is assigned to a peptide if and only if the RT of this peptide
is within the boundaries of this peak.
After alignment, for each peptide being quantified {\M} performs XIC
extraction and peak detection on it. Remember that a XIC extracted
for a given peptide, is the intensity curve of the peptide's m/z during the
whole chromatographic retention time. So, not all the peaks detected
on this XIC correspond to the retention time this peptide has been
identified in the MS-run being quantified. Moreover, after retention
time alignment, the retention time of the peptide can change.
That is why {\M} performs peak matching on each detected peak: the detected peaks for a
peptide are assigned/matched to the peptide being quantified. This
peak matching is based on retention times and it is performed as
follows: the peak is assigned to a peptide if and only if the RT of this peptide
(after alignment) is within the boundaries of this peak.
What is the RT of a peptide in a given MS-run?
What is the RT of a peptide?
In a given run a peptide can be identified or not. In case it has been
identified, it can be identified at several retention times, this is
why its RT is computed with the \emph{best RT} method; in case it has
not been identified, its RT is computed with the \emph{smart
identified, it can be identified at several distinct chromatographic
retention times: in that case {\M} computes its \emph{real RT} following the
\emph{best RT} method explained below.
In case the peptide has not been identified in the run, its
\emph{mean RT} is computed following the \emph{smart
quantification} method which allows quantification of peptides even
in runs they have not been identified, provided they have been
identified in another run of the same group.
in the runs where they have not been identified (provided that they have been
identified in another run of the same group).
\subsection{The best RT method}\label{bestRT}
The user can choose between three different modes in {\M} to perform peak matching :
\begin{itemize}
\item the \emph{real\_or\_mean} mode;
\item the \emph{mean} mode;
\item the \emph{post\_matching} mode.
\end{enumeration}
The two first modes correspond to the computation mode of the RT of
the peptide whose peaks are being matched. The last mode is more
complex. Here follows an explanation of each of this peak matching
modes.
\subsection{The \emph{real RT} mode and the best RT method}\label{bestRT}
A given peptide can be observed/identified at several different
retention times in a given run, with different intensities or charge
states. During parsing {\M} will get from the LC-MS run mzXML or mzML
states. During parsing, {\M} will get from the LC-MS run mzXML or mzML
file all the retention times the peptide has been identified in and
the corresponding precursor intensities. He will then retain only the
retention time corresponding to the most intense occurrence of this
peptide. This will be the retention time of this peptide for this run
during the rest of the analysis. We refer to this method as the \emph
{best RT} method.
\subsection{Smart Quantification}\label{smart_quanti}
If a given peptide has been identified in at least one sample of the
group, during peak matching in the samples where this peptide has not
been identified {\M} will nevertheless check for peaks corresponding
to this peptide.
Indeed {\M} computes the mean of this peptide's retention times in
the samples it has been identified in (more precisely it computes the
mean of all its best RTs). In the sample the peptide has
not been identified in, {\M} checks whether this mean RT belongs to a
detected peak area or not; if it does the peak and its quantification
value are assigned to this peptide.
during the rest of the analysis. We refer to this retention time as the
\emph{real retention time} of the peptide in the MS-run, meaning the
peptide has been really observed in this run. We refer to the
computation method above as the \emph{best RT} method.
\subsection{The \emph{mean RT} mode and smart quantification}\label{smart_quanti}
If a given peptide has never been observed/identified in a given run,
{\M} nevertheless computes a retention time for this peptide in this
run as the mean of its \emph{real retention times} in
the other runs of the same group.
To be more precise, {\M} computes the mean of all the best RTs of the
peptide in the runs of the group where the peptide was
observed/identified in.
This way, {\M} is able to align and quantify a given peptide even in
runs where it has not been identified. It suffices that this peptide
has been identified in at least one run of the same group.
During peak matching in the samples where this peptide has not
been identified, {\M} will nevertheless try to assign the detected
peaks to the \emph{mean RT} of this peptide.
\subsection{The \emph{real\_or\_mean RT} peak matching mode}
In the \emph{real\_or\_mean RT} mode the RT of a peptide in a given MS
run is its \emph{real RT} if the peptide has been identified in this
run, or its \emph{mean RT} if not. This RT is then used to assign the
detected peaks to this peptide as follows : for each peptide in each
MS run, for each detected peak on the XIC of this peptide, if the
peptide's \emph{real\_or\_mean RT} in this run is within the peak's rt
boundaries, than the peak is assigned to this peptide.
\subsection{Peak post-matching mode}
In both previous peak matching modes (\emph{real\_or\_mean} and \emph{mean} mode) the
peak matching is performed peak after peak during quantification: when
matching a new peak to a peptide, the previously matched
peaks of that same peptide in the same group are not considered. But
they could! Whenever a peak is matched to a peptide, we could consider
the retention time corresponding to the maximum intensity of this
peak. Let us call it the \emph{best matched RT}. When trying to match
a new peak to this peptide, instead of trying to match the \emph{real_or_mean RT} of
this peptide, we could try to match the previously computed \emph{best
matched RT} of the peptide. But we could do even better! If we could
have all the matched peaks of the peptide for a given group, we could
compute the mean of their \emph{best matched RTs} before tryig to
rematch the previously unmatched peaks.
That is what the post-matching mode does. The post-matching process is
performed as follows :
\begin{itemize}
\item A first peak matching is performed during quantification, peak
after peak, as in the two previous modes, with the matched retention
time of the peptide being its \emph{real RT} in every MS-run it has
been identified in. This means that no peak matching is performed
during this step in the runs where the peptide has not been
identified.
\item During the previous first step, if a peak
is matched, we compute the retention time of the maximum intensity of
this peak. We will call it the \emph{best matched RT} of the peptide
in this MS run. If a peak is not matched, we
put it beside and keep it for a second peak matching round. We also
consider that for this MS run, the peptide's \emph{best matched RT}
is its real RT if any. If no real RT is possible (the peptide is not
identified in this run), there is no \emph{best matched RT} for this
peptide in this run.
\item After the first peak matching and quantification round is finished
in the group, we perform another peak matching and quantification round on the
previously kept aside unmatched peaks. For every peptide in every run of the
group, we try to re match the unmatched peaks of this run to the
\emph{best matched RT} of this peptide in the group. The \emph{best
matched RT} of a peptide in a group is the mean value over all the
msruns of the group of the \emph{best matched RTs} of this peptide in
each run of the group if any.
\end{itemize}
One can see that the second matching round in the post-matching mode
is an optimisation of the \emph{real\_or\_mean} matching mode : the
peptides are matched during the first round via their real RTs in the
runs they have been identified in. Then, during the second round we
match the peptides in the runs they have not been identified in via a
finer computed mean RT, which takes into account the maximum intensity
RTs of the previously matched peaks of each peptide.
\subsection{What peak matching mode should I choose?}
The peak post-matching feature becomes
interesting in cases of rich complex samples acquired in high
resolution systems, without noise. Indeed, in these cases, the peaks are dense and
well separated, so the possibility to find previously missed relevant peaks is
bigger, without risking to assing noise peaks.
On the other side, its use in case of complex samples acquired with low
resolution systems is not recommended. Indeed in low resolution
samples, peaks are not as well separated as in high resolution ones,
the background noise peaks are much more present and consecutive
overlapping or very close peaks appear often. This increases the risk
of peak misassignments during the peak post-matching process. In these
cases, the \emph{real\_or\_mean} mode is recommended.
According to our tests, the use of the post-matching feature in case of non complex samples
(for example pre-fractionated ones) seems to not have a
significant impact on the number of newly matched peaks, or in
the number of missassigned peaks.
\chapter{The masschroqML format}\label{xml-sec}
......
......@@ -12,249 +12,298 @@
#include <QStringList>
QuantiItemPeptide::QuantiItemPeptide(XicExtractionMethodBase & extraction_method,
const Peptide * p_peptide,
unsigned int z,
const msRunHashGroup & group,
const mcq_matching_mode & match_mode)
:
QuantiItemBase(extraction_method),
_p_peptide(p_peptide),
_z(z),
_group_to_quantify(group),
_match_mode(match_mode)
const Peptide * p_peptide,
unsigned int z,
const msRunHashGroup & group,
const mcq_matching_mode & match_mode)
:
QuantiItemBase(extraction_method),
_p_peptide(p_peptide),
_z(z),
_group_to_quantify(group),
_match_mode(match_mode)
{
this->setType();
this->setType();
if (isPostMatchingRequired()) {
_mismatched_peaks = new vector<xicPeak *>;
} else {
_mismatched_peaks = 0;
}
if (isPostMatchingRequired())
{
_mismatched_peaks = new vector<xicPeak *>;
}
else
{
_mismatched_peaks = 0;
}
}
QuantiItemPeptide::~QuantiItemPeptide() {
vector<xicPeak *>::iterator it_mis_peak;
for (it_mis_peak = _mismatched_peaks->begin();
it_mis_peak != _mismatched_peaks->end();
++it_mis_peak) {
if (*it_mis_peak != 0) {
delete (*it_mis_peak);
*it_mis_peak = 0;
}
}
QuantiItemPeptide::~QuantiItemPeptide()
{
vector<xicPeak *>::iterator it_mis_peak;
for (it_mis_peak = _mismatched_peaks->begin();
it_mis_peak != _mismatched_peaks->end();
++it_mis_peak)
{
if (*it_mis_peak != 0)
{
delete (*it_mis_peak);
*it_mis_peak = 0;
}
}
if (_mismatched_peaks != 0) {
delete (_mismatched_peaks);
_mismatched_peaks = 0;
}
if (_mismatched_peaks != 0)
{
delete (_mismatched_peaks);
_mismatched_peaks = 0;
}
}
void
QuantiItemPeptide::setType() {
_type = PEPTIDE_QUANTI_ITEM;
QuantiItemPeptide::setType()
{
_type = PEPTIDE_QUANTI_ITEM;
}
const Peptide *
QuantiItemPeptide::getPeptide() const {
return (_p_peptide);
QuantiItemPeptide::getPeptide() const
{
return (_p_peptide);
}
const unsigned int *
QuantiItemPeptide::getZ() const {
return (&_z);
QuantiItemPeptide::getZ() const
{
return (&_z);
}
QString
QuantiItemPeptide::getSimpleName(const Msrun * msrun) const {
QString name(_p_peptide->getXmlId()), tmp;
name.append("_mz_");
tmp.setNum(get_mz());
name.append(tmp).append("_rt_");
tmp.setNum(getMatchingRt(msrun));
name.append(tmp).append("_l_");
tmp.setNum(get_min_mz());
name.append(tmp).append("_h_");
tmp.setNum(get_max_mz());
name.append(tmp).append("_z_");
tmp.setNum(_z);
name.append(tmp);
name = name.replace(".", "-");
// name.append(suffixName());
return (name);
QuantiItemPeptide::getSimpleName(const Msrun * msrun) const
{
QString name(_p_peptide->getXmlId()), tmp;
name.append("_mz_");
tmp.setNum(get_mz());
name.append(tmp).append("_rt_");
tmp.setNum(getMatchingRt(msrun));
name.append(tmp).append("_l_");
tmp.setNum(get_min_mz());
name.append(tmp).append("_h_");
tmp.setNum(get_max_mz());
name.append(tmp).append("_z_");
tmp.setNum(_z);
name.append(tmp);
name = name.replace(".", "-");
// name.append(suffixName());
return (name);
}
QStringList
QuantiItemPeptide::getTracesTitle(const Msrun * msrun) const {
QList<QString> header;
QString pep("peptide = "), mz("mz = "), mz_value, rt_mode("rt_mode = "), rt("rt = "), rt_value;
pep.append(_p_peptide->getXmlId());
mz_value.setNum(get_mz());
mz.append(mz_value);
rt_mode.append(_match_mode);
rt_value.setNum(getMatchingRt(msrun));
rt.append(rt_value);
header << pep << mz << rt_mode << rt;
return QStringList(header);
QuantiItemPeptide::getTracesTitle(const Msrun * msrun) const
{
QList<QString> header;
QString pep("peptide = "), mz("mz = "), mz_value, rt_mode("rt_mode = "), rt("rt = "), rt_value;
pep.append(_p_peptide->getXmlId());
mz_value.setNum(get_mz());
mz.append(mz_value);
rt_mode.append(_match_mode);
rt_value.setNum(getMatchingRt(msrun));
rt.append(rt_value);
header << pep << mz << rt_mode << rt;
return QStringList(header);
}
const mcq_matching_mode
QuantiItemPeptide::getMatchMode() const {
return _match_mode;
QuantiItemPeptide::getMatchMode() const
{
return _match_mode;
}
const bool
QuantiItemPeptide::isPostMatchingRequired() const {
return (_match_mode == POST_MATCHING_MODE);
QuantiItemPeptide::isPostMatchingRequired() const
{
return (_match_mode == POST_MATCHING_MODE);
}
void
QuantiItemPeptide::printInfos(ostream & out) const {
out << "peptide quantification item :" << endl;
_p_peptide->printInfos(out);
out << "_z = " << _z << endl;
out << "_mz = " << _mz << endl;
out << "rt matching mode= " << _match_mode.toStdString() << endl;
out << "_low_mz = " << _low_mz << endl;
out << "_high_mz = " << _high_mz << endl;
QuantiItemPeptide::printInfos(ostream & out) const
{
out << "peptide quantification item :" << endl;
_p_peptide->printInfos(out);
out << "_z = " << _z << endl;
out << "_mz = " << _mz << endl;
out << "rt matching mode= " << _match_mode.toStdString() << endl;
out << "_low_mz = " << _low_mz << endl;
out << "_high_mz = " << _high_mz << endl;
}
const mcq_double
QuantiItemPeptide::getMatchingRt(const Msrun * msrun) const {
QuantiItemPeptide::getMatchingRt(const Msrun * msrun) const
{
if ( _match_mode == MEAN_MODE ) {
return (getMeanRt(_group_to_quantify));
}
else if ( _match_mode == POST_MATCHING_MODE ) {