Commit 38314778 authored by Filippo Rusconi's avatar Filippo Rusconi
Browse files

Started writing some basis in bottom-up proteomics.

parent 3ae2f6a8
......@@ -31,6 +31,9 @@ xml:lang="en">
<xi:include href="generalities.xml" xmlns:xi="http://www.w3.org/2001/XInclude">
</xi:include>
<xi:include href="basics-in-bottom-up-proteomics.xml"
xmlns:xi="http://www.w3.org/2001/XInclude"> </xi:include>
<!--<xi:include href="main-program-window.xml"-->
<!--xmlns:xi="http://www.w3.org/2001/XInclude"> </xi:include>-->
......
<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE chapter [
<!ENTITY % entities SYSTEM "pappso-user-manuals.ent">
%entities;
<!ENTITY % xtpcpp-entities SYSTEM "xtpcpp-entities.ent">
%xtpcpp-entities;
<!ENTITY % sgml.features "IGNORE">
<!ENTITY % xml.features "INCLUDE">
<!ENTITY % dbcent PUBLIC "-//OASIS//ENTITIES DocBook Character Entities V4.5//EN" "/usr/share/xml/docbook/schema/dtd/4.5/dbcentx.mod">
%dbcent;
]>
<chapter xmlns="http://docbook.org/ns/docbook"
xmlns:xlink="http://www.w3.org/1999/xlink"
xml:id="chap_main-program-window" version="5.0">
<info>
<title>Fundamentals in bottom-up proteomics</title>
<keywordset>
<keyword>Fundamentals</keyword>
</keywordset>
</info>
<para>
This chapter is an optional chapter which the reader might be referred to upon
reading other part of this manual.
</para>
<sect1 xml:id="sec_general-overview-bottom-up-proteomics">
<title>General overview of bottom-up proteomics</title>
<para>
Bottom-up proteomics is a field of endeavour where the ultimate aim is to
identify the greatest number of proteins in a given sample. This aim might
also, depending on the project at hand, be doubled with another aim:
characterize at the finest level possible the nature and the position of
post-translational/chemical modifications beared by the proteins.
</para>
<para>
To achieve the best results, proteomics has developed over the years a set
of methods and techniques that, taken together, have allowed scientists to
obtain impressive results of protein identification on pretty complex
samples. These are listed below:
<itemizedlist>
<listitem>
<para>
<emphasis>Mass spectrometers: </emphasis>The development of mass
spectrometers of ever-greater resolution power has allowed to attain
at ever-lower false discovery rates over the years. In particular, the
development of the Orbitrap analyzers, along with the huge
improvements of the time-of-flight (TOF) mass analyzer technology,
have strongly increased the identification results reliability by
allowing the downstream data processing step to be more stringent in
the protein identification task (see below);
</para>
</listitem>
<listitem>
<para>
<emphasis>Chromatography: </emphasis>The development of highly
resolutive chromatography resins along with the elaboration of
hardware (columns, chromatography setups) that yields sensitivity
improvements has its part in the way proteomics has evolved over the
years;
</para>
</listitem>
<listitem>
<para>
<emphasis>Bioinformatics: </emphasis>The development and refinement of
software that can cope with extremely large data sets (think
metaproteomics) is one major field that enabled significant advances
in proteomics. Also, refinement of algorithms related to the
simulation of isotopic clusters and comparison with experimental data
have had their part. Likewise so for algorithms that detect the charge
of ions based on the analysis of the isotopic cluster peaks. Being
able to single out without error the monoisotopic peak of an isotopic
cluster (whater the ion charge or m/z ratio) is a big part of the
challenges at the root of successful proteomics data processing.
</para>
</listitem>
</itemizedlist>
</para>
<para>
In this section, we will review the bioinformatics-based mass spectrometry
data processing, as it is the core subject of this user manual. In
particular, we will provide an outline of how the major software packages on
the market perform protein identification on the basis of mass spectrometric
analyses of biological samples.
</para>
<sect2 xml:id="sec_from-the-sample-to-the-protein-identities">
<title>From the sample to the protein identifies</title>
<para>
This section will outline in very rough terms how bottom-up proteomics
works from the protein sample to the protein identification list.
</para>
<sect3 xml:id="sec_protein-digestion">
<title>The first step: digestion of the sample's proteins</title>
<para>
The very first step in the bottom-up protemics workflow is to digest all
the proteins in the initial biological sample with a site-specific
endoprotease: typically tryspin.
</para>
<para>
The sample is subjected to proteolysis with all its proteins unresolved.
This produces a highly complex mixture of peptides, each having a
constant characteristic: each peptide has one predictable end (unless it
is either the protein's N-terminal or the C-terminal peptide, as
detailed below), either N-terminal or C-terminal:
<itemizedlist>
<listitem>
<para>
<emphasis>Predictable N-terminus: </emphasis>when the protease
cuts at the N-terminal end of the target residue. For example,
EndAspN cleaves left of Asp residues, thus producing peptides that
always have Asp as their N-terminal residue. The only exception is
when the peptide is the protein's N-terminal peptide and first
residue is not Asp);
</para>
</listitem>
<listitem>
<para>
<emphasis>Predictable C-terminus: </emphasis>when the protease
cuts at the C-terminal end of the target residue. For example, the
most used enzyme, trypsin, cuts right of the basic residues Lys
and Arg. The generated peptides thus necessarily end with one of
these two residues. The only exception is when the peptide is the
protein's C-terminal peptide and last residue is not Lys nor Arg.
</para>
<tip>
<para>
One interesting feature of trypsinolyis is that it generates
peptides that&emdash;for their major part&emdash;will most
probably be protonated twice: on their N-terminal end (the
primary NH<subscript>2</subscript> amine group
<footnote><para>If not either converted to an amide group by
acetylation or formylation or cyclised.</para></footnote> and on
the basic residual chain of the basic residue found at their
C-terminal position (the &lnepsilon;-amine group for Lys and the
guanidium group for Arg). Upon fragmentation of the peptide's
precursor ion, both the left hand side fragment and the right
hand side fragment will bear a proton and will thus be detected,
thus potentially providing a better coverage of the peptide's
sequence during the MS/MS experiment.
</para>
</tip>
</listitem>
</itemizedlist>
</para>
</sect3>
<sect3 xml:id="sec_peptidic-mixture-chromatographic-separation">
<title>Peptidic mixture chromatographic separation</title>
<para>
One major analytic step in bottom-up proteomics is the separation of the
peptides obtained by endoproteolysis of all the proteins in the sample.
Indeed, analyzing all the peptides in one single injection without prior
chromatographic resolution would yield catastrophic results similar to
having injected nothing in the mass spectrometer.
</para>
<para>
The typical method for resolving peptides is by separating them on a
chromatographic column functionalized with a hydrophobic group (for
peptides, that would be a C<subscript>18</subscript> reversed phase
column).
</para>
<para>
The chromatographic gradient that will elute the peptides progressively
according to their increasing hydrophobicity will be developed over the
5&endash;95&nbsp;% of acetonitrile (a non-protic organic solvent).
<tip>
<para>
Using acetonitrile as the non-protic organic solvent has the
huge benefit of not injecting protons inside the mass
spectrometer as the chromatographic gradient develops.
</para>
</tip>
</para>
<para>
The effluent of the chromatographic column is directly injected into the
mass spectrometer's source. The role of the mass spectrometer's source
device is to ensure that the analytes are desolvated and ionized upon
their entering in the core part of the mass spectrometer. Most often,
that source is an electrospray source that is fed a liquid (typically,
the effluent from the column) and that evaporates the solvent
and&emdash;having an electric potential applied to it&emdash; ionizes
the eluted analytes. The electrically charged analytes in the gas phase
are thus ions the &mz; (mass-to-charge) ratio of which can be measured
by the mass spectrometer analyzer.
</para>
<warning>
<para>
There are two main sources used in the mass spectrometry for biology
specialty: the matrix-assisted laser desorption ionization (MALDI)
source and the electrospray ionization (ESI) source. One important
difference between the two is that the MALDI process mostly produces
mono-charged ions (&mh;, while the ESI process mostly produces
multi-charged ions (&mnh;). This has
huge implications in the mass data analysis.
</para>
<para>
The source that is mainly used in bottom-up proteomics is the ESI
source.
</para>
</warning>
</sect3>
<sect3 xml:id="sec_mass-spectrometric-analysis-of-the-peptides">
<title>Mass spectrometric analysis of the peptides</title>
<para>
Upon elution off the chromatographic column, the peptides are
desolvated, ionized and drawn into the mass spectrometer using an
electrical field. Once they have entered the mass spectrometer they are
analyzed in the mass analyzer of the instrument.
</para>
<note>
<para>
There are a variety of mass analyzers commonly used in bottom-up
proteomics. In fact, one single instrument might have as many as 4 or
5 mass analyzers. However, not all the analyzers in the instrument are
responsible for the &mz; measurement.
</para>
<para>
Sometimes, during the whole cycle of the analysis, two different mass
analyzers are used at different steps of the cycle. This will be
described later when the gas fragmentation of the peptides is
explained.
</para>
</note>
<para>
In bottom-up proteomics, two different kinds of mass spectrometric data
are required&emdash;ideally, for each peptide eluted from the
column&emdash; in order to effectively identify the proteins in the
initial sample:
<itemizedlist>
<listitem>
<para>
The mass-to-charge ratio (&mz;) of the peptide ion;
</para>
</listitem>
<listitem>
<para>
The &mz; values of the fragments (the product ions) of the
peptidic precursor ion that has undergone and MS/MS gas phase
fragmentation<footnote><para>Most often, that fragmentation step
is performed using collisionally-activated dissociation (CID). In
this process, the peptidic precursor ion is first isolated in the
gas phase on the basis of its &mz; value and then is accelerated
against a gas <quote>fog</quote> inside of the collision cell of
the instrument. The ion hits gas molecules multiple times,
acquires a lot of energy and finally breaks.</para></footnote>.
</para>
</listitem>
</itemizedlist>
These two kinds of data are necessary because the protein identification
process is based on searches in protein databases using the precursion
ions' &mz; value and the &mz; values of that ion's fragment when it is
fragmented. The way the protein databases are used in order to be the
substrated of these searches is described in the next section.
</para>
</sect3>
<sect3 xml:id="sec_the-protein-databases-and-their-use">
<title>The protein databases and their use</title>
<para>
The previous section ended on the idea that the protein identification
process, that is based on the analysis of all the peptides of a peptidic
mixture resulting from the endoproteolysis of a sample containing many
proteins, requires searches into protein databases.
</para>
<para>
A bottom-up proteomics experiment typically needs at least one protein
database: a database listing all the known proteins of the organism from
which the initial sample of proteins was prepared. That organism might
be a bacterium, a Eucaryote like a fungus, a protist, a
mammalian&ellipsis; Optional databases might be used, like contaminant
protein databases, for example.
</para>
<para>
The protein databases have to store the protein data in files having the
following FASTA format:
<literallayout class="monospaced" xml:space="preserve">
>GRMZM2G009506_P01 NP_001149383 serine/threonine-protein kinase receptor
MEEQHMAGPPYRYRLQHRRLMDIAPASASDDDSGHHGSNGMAIMVSILVVVIVCTLFYCV
YCWRWRKRNAVRRAQIERLRPMSSSDLPLMDLSSIHEATNSFSKENKLGEGGFGPVYRGV
MGGGAEIAVKRLSARSRQGAAEFRNEVELIAKLQHRNLVRLLGCCVERDEKMLVYEYLPN
RSLDSFLFDSRKSGQLDWKTRQSIVLGIARGMLYLHEDSCLKVIHRDLKASNVLLDNRMN
PKISDFGMAKIFEEEGNEPNTGPVVGTYGYMAPEYAMEGVFSVKSDVFSFGVLVLEILSG
QRNGSMYLQEHQHTLIQDAWKLWNEDRAAEFMDAALAGSYPRDEAWRCFHVGLLCVQESP
DLRPTMSSVVLMLISDQTAQQMPAPAQPPLFASSRLGRKASASDLSLAMKTETTKTQSVN
EVSISMMEPRFWADPGTSNGAATSHPATGACKKRGGQGGDRNVKDGLAARTPTHQPVARW
HHDRRIVD
</literallayout>
</para>
<para>
This format is really simple, because it only contains three information
pieces in as many stanzas as there are proteins in the database:
<itemizedlist>
<listitem>
<para>
The protein's accession id in the database
(<code>GRMZM2G009506_P01</code>) that comes right after the '>'
prompt that signals a new protein stanza in the file;
</para>
</listitem>
<listitem>
<para>
The protein description (<code>NP_001149383
serine/threonine-protein kinase receptor</code>) that provides
some functional data bits for the protein at hand;
</para>
</listitem>
<listitem>
<para>
The protein sequence (the rest of the stanza above).
</para>
</listitem>
</itemizedlist>
The first (id) and second (description) information bits are used in
various places in the &xtpcpp; program.
</para>
<para>
The protein databases are used by the protein identification software as the very first step in a bottom-up proteomics data analysis process: the proteins in the database are digested <foreignphrase>in silico</foreignphrase> in order to produce the following data bits in memory (for each protein):
<itemizedlist>
<listitem>
<para>
Digestion using a site-specific protease. This step produces a
list of peptides for each protein. For each of these peptides: the
following data bits are generated:
</para>
<itemizedlist>
<listitem>
<para>
<emphasis>sequence: </emphasis>This peptide's sequence;
</para>
</listitem>
<listitem>
<para>
<emphasis>&mz; value: </emphasis>This peptide's &mz; value,
often computed for the mono-protonated (&mh;) ion;
</para>
</listitem>
</itemizedlist>
</listitem>
</itemizedlist>
</para>
</sect3>
</sect2>
</sect1>
</chapter>
......@@ -14,8 +14,37 @@
<!ENTITY xpc "<application>XpertCalc</application>">
<!ENTITY xpm "<application>XpertMiner</application>">
<!ENTITY emdash "—">
<!ENTITY ellip "&#8230;">
<!--<!ENTITY hyphen "‐">-->
<!ENTITY hyphen "&#x2010;">
<!--<!ENTITY nbhyphen "‐">-->
<!ENTITY nbhyphen "&#x2011;">
<!--<!ENTITY endash "–">-->
<!ENTITY endash "&#x2013;">
<!--<!ENTITY emdash "—">-->
<!ENTITY emdash "&#x2014;">
<!--<!ENTITY ellip "...">-->
<!ENTITY ellipsis "&#x2026;">
<!--This one is bold-->
<!ENTITY lbalpha "&#x1D6C2;">
<!ENTITY lnalpha "&#x03B1;">
<!ENTITY lnbeta "&#x03B2;">
<!ENTITY lngamma "&#x03B3;">
<!ENTITY ungamma "&#x0393;">
<!ENTITY lndelta "&#x03B4;">
<!--This one is fit for the normal text size-->
<!ENTITY lnepsilon "&#x025B;">
<!--This one is big-->
<!ENTITY lbepsilon "&#x1D700;">
<!-- nbnsp is non breaking *no* space, that is a non breaking no-width space-->
<!ENTITY nbnsp "&#xFEFF;">
......@@ -33,10 +62,12 @@
<!ENTITY emsp "&#x2003;">
<!ENTITY rt "rt">
<!ENTITY mz "m&nbnsp;/&nbnsp;z">
<!ENTITY mzi "(&mz;,&int;)">
<!ENTITY msn "MS<superscript>n</superscript>">
<!ENTITY mh "[M+H]<superscript>+</superscript>">
<!ENTITY mnh "[M+nH]<superscript>n+</superscript>">
<!ENTITY dt "dt">
<!ENTITY int "i">
<!ENTITY mzi "(&mz;,&int;)">
<!ENTITY mr "M<subscript>r</subscript>">
<!ENTITY c12 "[<superscript>12</superscript>C]">
<!ENTITY c13 "[<superscript>13</superscript>C]">
......
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment