Identifying Peptides
This post is a followup to
my introduction to mass spectrometry data. Here I'll go through a typical search pipeline composed of data acquisition and exploration, followed by a search using the
Comet search engine, and then the
Percolator rescoring software. I'll also go over error control by the target decoy approach (false discovery rate, TDA-FDR) and using ground truth peptide identities (false discovery proportion, FDP). For this demonstration, I will be taking 10000 spectra from the
Massive-KB dataset which are
not part of the ProteomeTools data. The exclusion is for demonstration purposes, the subset is for time and accessibility (searching for all the spectra can take dozens of hours or even days depending on available hardware). Unfortunately, Massive-KB is not the best dataset to do this kind of thing because it is too heterogeneous, but it is what I had on hand at the moment. It's also convenient because it is available from the source in mgf format instead of just proprietary files. We will use the full dataset rather than the "in vivo, HCD only" version to provide the opportunity to discuss data selection and cleaning. For real experiments, selecting the right set is, of course, preferred.
Data and Software
Figure 1 - the workflow demonstrated in this post.
The first step, as in any other project, is to acquire the data and software we need. All requisite links can be found above for those who'd like to follow along: the Massive-KB dataset is available in MGF format, which means no need to find a converter for a proprietary format. Comet is one of the best search engines in practice
1 (although my own search engine is superior ;-), and is also very fast and easy to use. It works well on most platforms, unlike some C#-based engines or proprietary engines, for example, so it's a good default choice. Percolator is the de facto standard rescoring algorithm. The downsides of rescoring will be discussed along with error control later. Speaking of which, we will not be relying on an external tool for error control, but rather write our own (it's just a few lines of python).
Finally, we need a dataset to perform the search against. This will be a protein database (other options include the likes of a database of open reading frames from RNA sequencing experiments or a reference genome with 6-frame translation). We will be using the
UniProt human proteome for this purpose.
Data Processing
After acquiring the
human proteome (download all, uncompressed) and the Massive-KB data (in MGF format), there are two main steps we want to take here: first, check what the data looks like. Second, subset and prepare the data for the search. For demonstration purposes, we will want to get rid of data in Massive-KB that is also in ProteomeTools. This won't be any hard and we won't be comparing the data in both datasets so as to get rid of it, instead Massive-KB provides us with a dataset provenance field that we'll use to remove offending spectra. The conceptual reason for doing this is that we can imagine wanting to use Massive-KB for algorithm development, and ProteomeTools for evaluation. It would do no good to suffer data contamination (e.g. overfitting)!
The search database is a standard proteome fasta that looks like this:
>sp|A0A0C5B5G6|MOTSC_HUMAN Mitochondrial-derived peptide MOTS-c OS=Homo sapiens OX=9606 GN=MT-RNR1 PE=1 SV=1
MRWQEMGYIFYPRKLR
>sp|A0A1B0GTW7|CIROP_HUMAN Ciliated left-right organizer metallopeptidase OS=Homo sapiens OX=9606 GN=CIROP PE=1 SV=1
MLLLLLLLLLLPPLVLRVAASRCLHDETQKSVSLLRPPFSQLPSKSRSSSLTLPSSRDPQ
PLRIQSCYLGDHISDGAWDPEGEGMRGGSRALAAVREATQRIQAVLAVQGPLLLSRDPAQ
YCHAVWGDPDSPNYHRCSLLNPGYKGESCLGAKIPDTHLRGYALWPEQGPPQLVQPDGPG
VQNTDFLLYVRVAHTSKCHQETVSLCCPGWSTAAQSQLTAALTSWAQRRGFVMLPRLCLK
LLGSSNLPTLASQSIRITGPSVIAYAACCQLDSEDRPLAGTIVYCAQHLTSPSLSHSDIV
MATLHELLHALGFSGQLFKKWRDCPSGFSVRENCSTRQLVTRQDEWGQLLLTTPAVSLSL
AKHLGVSGASLGVPLEEEEGLLSSHWEARLLQGSLMTATFDGAQRTRLDPITLAAFKDSG
WYQVNHSAAEELLWGQGSGPEFGLVTTCGTGSSDFFCTGSGLGCHYLHLDKGSCSSDPML
EGCRMYKPLANGSECWKKENGFPAGVDNPHGEIYHPQSRCFFANLTSQLLPGDKPRHPSL
TPHLKEAELMGRCYLHQCTGRGAYKVQVEGSPWVPCLPGKVIQIPGYYGLLFCPRGRLCQ
TNEDINAVTSPPVSLSTPDPLFQLSLELAGPPGHSLGKEQQEGLAEAVLEALASKGGTGR
CYFHGPSITTSLVFTVHMWKSPGCQGPSVATLHKALTLTLQKKPLEVYHGGANFTTQPSK
LLVTSDHNPSMTHLRLSMGLCLMLLILVGVMGTTAYQKRATLPVRPSASYHSPELHSTRV
PVRGIREV
...
>sp|A0PK11|CLRN2_HUMAN Clarin-2 OS=Homo sapiens OX=9606 GN=CLRN2 PE=1 SV=1
MPGWFKKAWYGLASLLSFSSFILIIVALVVPHWLSGKILCQTGVDLVNATDRELVKFIGD
IYYGLFRGCKVRQCGLGGRQSQFTIFPHLVKELNAGLHVMILLLLFLALALALVSMGFAI
LNMIQVPYRAVSGPGGICLWNVLAGGVVALAIASFVAAVKFHDLTERIANFQEKLFQFVV
VEEQYEESFWICVASASAHAANLVVVAISQIPLPEIKTKIEEATVTAEDILY
...
>sp|A1L190|SYCE3_HUMAN Synaptonemal complex central element protein 3 OS=Homo sapiens OX=9606 GN=SYCE3 PE=1 SV=1
MDDADPEERNYDNMLKMLSDLNKDLEKLLEEMEKISVQATWMAYDMVVMRTNPTLAESMR
RLEDAFVNCKEEMEKNWQELLHETKQRL
(Suspension marks mine). Pretty simple and straightforward: the entry name is preceded by a > and ends with a newline, then multiple lines of amino acids representing one full protein are listed. The proteins (and therefore, lines) are of varying sizes, some smaller than one fasta line, and some spanning quite a few, as one would expect: the first protein, which just looks like a trypsic peptide, is not a bug.
As discussed in my previous post, MGF files tend to be a bit messier, with arbitrary field types and names, and varying formats for field values. Here's a sample from our Massive-KB mgf:
BEGIN IONS
PEPMASS=1221.531860351563
CHARGE=4
MSLEVEL=2
COLLISION_ENERGY=0.0
FILENAME=filtered_library_mgf_files/8158df9e16dc436bad91a659fc08c64c.mgf
SEQ=+229.163+42.011AAETQSLREQPEMEDANSEK+229.163SINEENGEVSEDQSQNK+229.163
PROTEIN=sp|Q13523|PRP4B_HUMAN
SCANS=1
SCAN=1
SCORE=1.0415726376102685
FDR=0.0
PROVENANCE_FILENAME=MSV000082644/ccms_peak/Fractionated_Phosphoproteome/TMTPlex04/BL20160823_FM_Medullo_Phosphoproteome_Plex04_Fxn01.mzML
PROVENANCE_SCAN=22878
PROVENANCE_DATASET_MSV=MSV000082644
PROVENANCE_DATASET_PXD=N/A
230.16993713378906 63768.40234375
244.09243774414062 10245.291015625
248.18014526367188 8222.767578125
...
END IONS
The
PROVENANCE_DATASET_PXD field refers to the
PRIDE proteomics data repository accession ID, if applicable. It can either have a value of N/A, or an actual value. The ProteomeTools ID is
PXD00004732, so this is how we'll filter data that comes from ProteomeTools. The analogous
MSV field is the Massive accession ID. Peaks (m/z - intensity pairs) are tab-separated, and the fields we care about are:
- PROVENANCE_DATASET_PXD for filtering
- PEPMASS, which is the precursor mass
- CHARGE, the precursor charge, unsurprisingly
- And the ground truth peptide sequence, SEQ, for evaluation
As a reminder, the only reason we have a (confident) peptide identity is because the Massive-KB project uses a consensus-based method to assess the quality of the identification. Depending on the processing task involved or the experiment being examined, the
SCORE field may also be of use as a quality control step, although in usual scenarios this is only available (as is peptide identity) after a search, which should already be error controlled (more on that later). An alternative to the Massive-KB consensus strategy is to synthesize peptides based on
in-silico digestion products, as done in ProteomeTools.
On the other hand, Massive-KB is fairly problematic because its MGF contents do not properly describe the kind of spectrum they are referring to, and the Massive-KB dataset was composed from the combination of many disparate experiments of varying types. Notably, it mixes quantification and identification experiments, and quantification experiments tend to modify peptides with identification tags, which affects the spectrum. Some spectra in the collection are digested by trypsin, but not all. Not all experiments had carbamidomethyl cysteine, and the set of variable modifications present in each experiment differs widely. The only way to ascertain what's what is to use the dataset accession ID information from the mgf, to look up the corresponding experiment, and to take notes of the parameters of interest to decide if we would like to include a certain subset or not.
The Massive-KB project also offers a subset that is uniquely human, and HCD-fragmented, with no synthetic peptides and no quantitative label tags or modifications, but we do not use it for demonstration purposes here (because there is no subset that offers those same features, plus the ProteomeTools subset). In practice, that subset still includes a variety of modifications that are inconsistent between the combined experiments. In addition, the dataset does not, unlike advertised, only contain tryptic peptides: for example it includes the MSV000081607 subset that contains LysC-digested HeLa cells. A hack we can make use of is to simply filter all spectra that do not have a certain modification regime. Since, in our scenario, we're interested in preparing Massive-KB data as a development set for experiments related to ProteomeTools, we can take the subset which has C(CAM) as fixed modification (so we reject any spectrum with a sequence that has C not followed by +52.021), and only oxidation of methionine as variable modification (so we accept all spectra with either M or M+15.995). The resulting file after ProteomeTools exclusion, modification selection, and a cut down to 10000 spectra is available
here.
Figure 2 - A spectrum from our dataset, and the corresponding peptide's b- and y-ions. Orange theoretical peaks are those that match within 100 ppm.
Visually examining the data by plotting is a useful way to get a feel for it. In Figure 2, the first spectrum in our Massive-KB mgf is shown (blue bars on top), while the corresponding peptide's theoretical spectrum (only the b- and y-ions) are displayed upside-down. This plot was realized with matplotlib, using a stem graph. Note the precursor mass, the charge, and the distribution of peaks relative to the precursor mass. It is not uncommon to clean a spectrum by removing all peaks beyond the exact mass (i.e. the precursor mass adjusted for a charge of 1), although I'm not a fan of most approaches that "destroy" the data like this. As shown in the figure, few of the theoretical peaks actually match between the experimental and theoretical (here I used a 100 ppm match window, i.e. peaks are considered a match if |peak mass - theoretical mass| / (peak mass) * 10Figure 3 - Some basic statistics from Massive-KB before we cut it down to size. These can be useful for quality assessment, sanity checking, and overall understanding the data better. The outlier in the mass chart is almost certainly an incorrect charge value in the spectrum data which should be corrected before further processing.
Figure 4 - Rasterization of a mass spectrum.
The core search settings control two main parts of the Comet algorithm: theoretical spectrum generation and correlation settings. The former is fairly self-explanatory: select which ion series to generate, and whether to include some neutral losses. For the scoring function parameters, a quick explanation of the basic algorithm is required.Figure 6 - Representation of Percolator's operations. Percolator essentially tries to find a hyperplane that separates all decoys vs all high-scoring targets. It rescores the held-out fold using the SVM fitted on the other two folds, and uses a linear mapping function to try to adjust the scores between the 3 fitted SVMs such that they are comparable across folds.
Figure 7 - Peptide identification statistics with Comet, before rescoring. Total PSMs: 265045. Total peptides: 255911. Score bias can be seen when plotting against peptide length or charge: the comet Xcorr is well known to give higher scores to longer peptides and higher charges. Note also that FDP is underestimated by a factor > 17x, demonstrating that TDA-FDR is not accurate here.
Thankfully, since we are using Massive-KB, we can also compute the analogous metric based on the true peptide identity as provided in the mgf. This metric is often referred to as the false discovery proportion (FDP). The process to compute it is the same as for the q-value on targets and decoys, except using peptide equality to ground truth as dichotomy, rather than target vs decoy. This allows us to tell how accurate the FDR estimate is (remember, it's supposed to be as close to the FDP as possible) by computing FDP for the score threshold selected for a fixed FDR value, as well as allowing us to compare identification rates under e.g. 1% FDR vs 1% FDP, and so forth. We will also be able to use this to compare how much Percolator improves identifications based on FDR vs FDP, i.e. to see if percolator enriches for false hits as opposed to true hits.Figure 8 - Identification results after rescoring by Percolator. Total PSMs: 26993. Total peptides: 25742. Percolator improves identification rates both under FDR and FDP cutoff metrics and displays different bias profiles than Comet (for example, it is biased for charge 1 rather than for higher charges). In this case, it does not seem that Percolator enriches for target false hits as opposed to true hits, since performance under fixed FDP or FDR improves similarly. The FDP underestimation by TDA-FDR hardly improves with Percolator.
Once done, we simply run the same error control schemes as before. That's it, we've identified us some peptides! We can plot the same kinds of statistics with our rescored results, as in Figure 8, and visually compare with those in Figure 7. Note that percolator discards all but the top hit per PSM, so we have about 1/10 as many spectra and peptides leftover. Here, observing the result plots show different score bias patterns before and after rescoring, similar performance increase at the FDR and FDP level, and no real improvement in false discovery misestimation after rescoring.