This format is really simple, because it only contains three information
pieces in as many stanzas as there are proteins in the database:
<itemizedlist>
<listitem>
<para>
The protein's accession id in the database
(<code>GRMZM2G009506_P01</code>) that comes right after the '>'
prompt that signals a new protein stanza in the file;
</para>
</listitem>
<listitem>
<para>
The protein description (<code>NP_001149383
serine/threonine-protein kinase receptor</code>) that provides
some functional data bits for the protein at hand;
</para>
</listitem>
<listitem>
<para>
The protein sequence (the rest of the stanza above).
</para>
</listitem>
</itemizedlist>
The first (id) and second (description) information bits are used in
various places in the &xtpcpp; program.
</para>
<para>
The protein databases are used by the protein identification software as the very first step in a bottom-up proteomics data analysis process: the proteins in the database are digested <foreignphrase>in silico</foreignphrase> in order to produce the following data bits in memory (for each protein):
<itemizedlist>
<listitem>
<para>
Digestion using a site-specific protease. This step produces a
list of peptides for each protein. For each of these peptides: the