A Guide to Proteomics Based on Mass Spectrometry for Beginners
Genes are the basic units of inheritance, but they only come to life when they are translated into proteins. Proteins are the main participants in biological functions, involved in processes ranging from biochemical reactions and signal transduction to structural support. The proteome is the collection of all proteins present in biological fluids, cells, and tissues, and it reflects the functional state of biological systems. Proteomics is the qualitative and quantitative study of the proteome and is often used to compare differences between different cellular states. It is widely used in the biomedical field. For example, we can reveal the cellular pathways and proteins required for viral infection and replication by analyzing the differences in the proteome between virus-infected and uninfected cells. Drugs can then be developed against these proteins to slow down the infection process. Proteomics is especially suited to revealing potential biochemical mechanisms due to its ability to directly characterize all proteins at once. In this article, we focus on the systematic characterization of the proteome using mass spectrometry (MS), or more specifically, bottom-up proteomics (i.e., proteins are first digested into peptides and then analyzed by MS).
Basics of Mass Spectrometry
The mass spectrometer was invented in 1912, and after continuous development, its detection limit, speed, and applicability have greatly improved. The principle of the mass spectrometer is to use the basic characteristics of molecules (such as mass and net charge) to detect the presence and abundance of peptides (or other biomolecules, such as metabolites, lipids, and proteins). When peptides gain net charge (usually by gaining protons), they are called peptide ions.
All mass spectrometers have three basic components: an ion source, a mass analyzer, and a detector. Since mass spectrometers can only analyze gaseous ions, methods such as electrospray ionization (ESI) are used to convert peptides from the liquid phase to gaseous ions. The liquid containing the peptides is pumped through a high-pressure hole of micrometer size (2-4 kilovolts). Upon reaching this emitter, the stable liquid flow is broken down into extremely small, highly charged, rapidly evaporating charged droplets, leaving peptide ions in the gas phase. The abundance of gaseous peptide ions directly reflects the concentration of the original protein, so using the lowest possible flow rate can effectively improve the detection sensitivity. In proteomics research, high-performance liquid chromatography (HPLC) is often used for peptide mixture separation, with the flow rate finely controlled at a few hundred nanoliters per minute, which is far superior to the flow rate of traditional HPLC, ensuring more precise detection results.
The main function of the mass analyzer in the mass spectrometer is to separate ions based on their mass-to-charge ratio (m/z). Fundamentally, all ions are separated by adjusting their trajectories in an electric field. The principles used by mass analyzers to separate ions during this process vary, which determines their respective application fields. In proteomics, quadrupole mass analyzers are common analytical equipment, often used in combination with time of flight (TOF) or Orbitrap analyzers. The working principle of the quadrupole mass analyzer is based on applying an oscillating electric field between four parallel cylindrical poles, each pair of poles will generate a radio frequency field with a phase offset. These fields together shape a pseudopotential surface that, when configured, allows all ions to pass or selectively allows only ions within a specific mass-to-charge ratio window to pass, thereby effectively separating the ions.
TOF mass analyzers separate ions by accelerating them to a voltage of about 20 kilovolts and separating them based on the time difference it takes for ions to reach the detector. TOF can detect sub-microsecond time differences, thus measuring mass differences at the parts per million (ppm) level. In contrast, Orbitrap mass analyzers distinguish ions based on their oscillation frequency. Ions are injected tangentially and then captured in the Orbitrap, where they move along the length axis of the central metal main axis. Although the Orbitrap is only a few centimeters long, ions can move several kilometers in it at high speed, achieving very high resolution (usually reaching tens of thousands of levels) and mass accuracy down to the ppm level.
In proteomics research, a "collision chamber," a quadrupole device specifically used for ion fragmentation, is usually connected after the quadrupole element. The complete peptide ions or fragment ions will enter the final stage containing the detector, the resulting spectra are called MS1 or precursor ion spectra in the former case, and MS2 or product ion spectra, also known as MS/MS spectra, in the latter case. TOF instruments use Microchannel Plate (MCP) detectors to capture ions. Whenever an ion contacts its surface, electrons are released, which are then amplified to accurately measure individual ions. However, this ultra-high sensitivity also comes with a challenge: in high signal conditions, the detector may become saturated due to too many ions. In contrast, Orbitrap analyzers measure the "image current" generated by rapidly oscillating ions. This current directly reflects the intensity of a single ion packet. The current is recorded in the time domain and converted to the frequency domain by Fourier transform. Although continuous advances in signal processing algorithms have multiplied the achievable resolution within a given signal transient time, these algorithms are still far slower than TOF analyzers. Specifically, a single TOF pulse takes only 100 microseconds, whereas the Orbitrap analyzer takes several tens to hundreds of milliseconds to complete the entire analysis process.
How does a mass spectrometer sequence or identify peptides? First, the mass spectrometer uses a quadrupole or other ion separation device to separate precursor ions with a specific mass-to-charge ratio (m/z). Then, these ions collide with inert gases (such as N2, He, or Ar) in the collision chamber to break down. In the collision process, ions mainly break at the lowest energy bonds, usually some amide bonds (peptide bonds) connecting amino acid residues. This process makes MS/MS spectra generate different peak ladders, the difference between peaks reflects the mass of amino acids. This peak ladder information is highly specific and is key to peptide sequence identification. By deeply analyzing these amino acid sequences and their mass on both sides (peptide sequence tags), we can identify specific peptides from the human proteome. In practice, it is more common to use database identification, which contains all possible fragment spectra, which are compared with experimental spectra for statistical scoring, thereby achieving accurate peptide identification.
Chromatographic retention times are important information for matching datasets with previous measurements and are key to "targeted proteomics" techniques. In addition, ion mobility analysis, as another dimension of peptide ion separation, has been widely used in recent years. Ions can be filtered by their cross-sections (FAIMS, Field Asymmetric Ion Mobility Spectrometry), or they can be physically separated during the analysis process (T-Wave or TIMS, Trapped Ion Mobility Spectrometry). TIMS is the basis of Parallel Accumulation-Serial Fragmentation (PASEF) technology, which increases the sequencing speed by 10 times while improving sensitivity.
Sample Preparation and Specific Enrichment
MS-based proteomics can analyze the protein content of any sample. In addition to primary samples such as cells, it can also analyze formalin-fixed paraffin-embedded (FFPE) biopsy tissues, and even fossils from hundreds of thousands of years ago. This is because proteins are very stable, much more so than RNA. Typically, proteins are separated after suitable biochemical enrichment procedures, such as cell gradient separation, affinity enrichment, or proximity analysis methods, depending on the experimental purpose.
Proteomic sample preparation requires both skillful techniques and scientific rigor. The end result is the digestion of proteins into peptides. The enzyme commonly used in this process is trypsin, which specifically cleaves at the C-terminus of arginine and lysine. This property allows the newly formed C-terminal peptides to carry a positive charge, enhancing the ionization and fragmentation of peptide ions. During the entire sample preparation process, it is necessary to avoid the use of polymers and detergents, as these substances will interfere with the ionization process of peptide ions. At the end of sample preparation, tens of thousands of proteins will be converted into hundreds of thousands of purified peptides, and the concentration difference of these peptides may be as high as one million times or more.
Monitoring Post-Translational Modifications
Protein primary structure amino acid sequences often carry modifications. These post-translational modifications (PTMs) are an efficient and subtle regulatory mechanism that can significantly affect protein activity and even function. PTMs are usually substoichiometric, i.e., only specific proteins are modified, so capturing and detecting these modifications is challenging. Most strategies use antibodies against PTMs or take advantage of the unique chemical properties of PTM groups to enrich peptides carrying modifications. Among them, phosphorylation, as the most studied PTM, often uses titanium dioxide-based beads for high-specificity enrichment of phosphopeptides. It is worth mentioning that with the rise of MS-based proteomics, more than 10,000 modification sites with single amino acid resolution and extensive cellular signal transduction networks can be detected in just 2 hours, which was difficult to achieve before. Nowadays, proteomics has become a routine research tool for revealing the important roles of ubiquitination, SUMOylation, acetylation, and glycosylation in biological processes. However, for those less common PTMs, especially those PTMs lacking highly specific antibodies, their analysis still faces certain challenges.
Data Acquisition and Quantification Strategies
At any specific moment when a mass spectrometer is collecting data, hundreds to thousands of peptides are ionized and enter the mass spectrometer simultaneously. In the past, these peptides were mainly analyzed by data-dependent acquisition (DDA) strategies, i.e., users selectively capture peptide ions by setting certain rules (such as mass-to-charge ratio, charge, intensity, and cross-section) to obtain MS/MS spectra. However, because the number of peptides far exceeds the limit of analysis time, this selection process inevitably has a certain randomness, resulting in some data becoming missing values. In contrast, data-independent acquisition (DIA) methods take a different strategy. In this method, the quadrupole continuously cycles over the entire mass range and selects a relatively large mass-to-charge ratio range (20-40 m/z), allowing all ions to be detected and broken down to ensure that all ion information in the sample is obtained without omission or difference. However, this makes the MS/MS spectra very complex. Modern software can analyze the spectra by comparing them with previously obtained "peptide libraries" to identify multiple peptides, but in more and more cases, they can also be done without comparison. New scanning modes are still under development, aiming to solve the "dynamic range problem" - how to effectively detect proteins of extremely low abundance in the presence of high-abundance proteins. For example, in the case of cell factors in the blood and albumin, the difference in abundance may be as high as 12 orders of magnitude, making it difficult to detect low-abundance proteins.
Peptide quantification includes two main categories: label-free quantification and label quantification. In label-free quantification (LFQ), researchers extract the spectral signal of peptides from the raw data (usually at the MS1 level), then normalize and compare it under the conditions of interest in the proteome. This method is intuitive and economical, providing great flexibility for project design. However, this strategy has a relatively high quantitative variance, and if not careful, the purity of the peptide and the performance difference of the instrument can affect the comparison between individual samples, thus affecting the accuracy of the results.
In labeled methods, stable isotopes are used to label different conditions of the proteome. Its advantage is that these isotopically labeled peptides have the same physicochemical properties, but predictable differences in mass. Isotopes can be naturally introduced through metabolic pathways or can be chemically labeled and "read". The latter is called isobaric labeling, and its detection principle is that the mass of the label does not change, but the distribution of isotopes in the label will manifest after fragmentation, thus distinguishing different samples. In a set of samples with 6 to 16 different labels, if the samples can be consistently and repeatably labeled and combined, the quantitative variance is usually lower than LFQ. However, isobaric labeling methods like TMT (tandem mass tag) also have certain limitations, namely that peptides that co-fragment may suppress quantitative differences, a phenomenon known as "ratio compression", which may to some extent affect the accuracy of quantitative results.
Regardless of the quantification and scanning mode used, the output of the mass spectrometer always includes MS1 and MS/MS spectra. Numerous software programs have been developed to handle these data, which first look for signals, i.e., "feature discovery", then use search engines to match MS/MS spectra precisely with peptide sequences in the database. Next, the software uses complex algorithms to reassemble the peptides into proteins, solving the "protein inference problem". Finally, precise quantitative analysis is performed on the peptide or protein level.
In simple terms, the output is a matrix that contains a series of proteins and their corresponding abundances in their respective samples, these outputs are screened through false discovery rate (FDR) thresholds. As research deepens, people are no longer satisfied with simple data analysis. Today, scientists are striving to enhance this functionality by integrating standard or proteomics-specific bioinformatics workflows (including machine learning techniques) and by combining proteomics data with other types of omics data (such as various next-generation sequencing (NGS) methods).
Multidimensional Readout of Functional Cell States
The development of mass spectrometry technology has entered a new stage, providing strong support for many cutting-edge applications such as proteome identification and quantification, protein-protein interaction (interactomics), organelle proteomics, and post-translational modification detection. Today, this technology is widely used in the medical field, especially in the routine use of identifying biomarkers. Although the operation process of MS-based proteomics may be more complicated compared to antibody-based methods, its outstanding detection specificity and global nature can fully make up for this shortcoming.
Proteomics plays a crucial role in biological research, serving as a bridge linking the gap between genotype and phenotype. Because even if genetic information is abnormal, it does not necessarily mean that it will directly affect cell function. Proteomics can assess the specific impact of these genomic abnormalities on protein function, thus providing more specific biomarkers or new therapeutic targets for disease subtypes.
In recent years, the significant improvement in the sensitivity of mass spectrometry has opened up new horizons for single-cell proteomics research. The advantage of this method is that it allows in-depth study of individual cells while retaining all spatial information of the cell environment. Compared to mRNA, proteins are more abundant, making single-cell proteomics research more robust and reliable. MS-based single-cell proteomics can directly reveal dynamic changes between cells (such as receptor-ligand interactions between cells and their microenvironment), providing a new perspective for understanding the complexity of cell communication and cell behavior.
How to order?