ProteomicsML: An Online Platform for Community-Curated Data Sets and Tutorials for Machine Learning in Proteomics

Authors

Tobias G. Rehfeldt*

Ralf Gabriels*

Robbin Bouwmeester*

Siegfried Gessulat

Benjamin A. Neely

Magnus Palmblad

Yasset Perez-Riverol

Tobias Schmidt

Juan Antonio Vizcaíno§

Eric W. Deutsch§

Published

September 30, 2022

Abstract
Data set acquisition and curation are often the most difficult and time-consuming parts of a machine learning endeavor. This is especially true for proteomics-based liquid chromatography (LC) coupled to mass spectrometry (MS) data sets, due to the high levels of data reduction that occur between raw data and machine learning-ready data. Since predictive proteomics is an emerging field, when predicting peptide behavior in LC-MS setups, each lab often uses unique and complex data processing pipelines in order to maximize performance, at the cost of accessibility and reproducibility. For this reason we introduce ProteomicsML, an online resource for proteomics-based data sets and tutorials across most of the currently explored physicochemical peptide properties. This community-driven resource makes it simple to access data in easy-to-process formats, and contains easy-to-follow tutorials that allow new users to interact with even the most advanced algorithms in the field. ProteomicsML provides data sets that are useful for comparing state-of-the-art machine learning algorithms, as well as providing introductory material for teachers and newcomers to the field alike. The platform is freely available at proteomicsml.org, and we welcome the entire proteomics community to contribute to the project at github.com/proteomicsml.

Published in the Third Special Issue on Software Tools and Resources of Journal of Proteome Research.

ProteomicsML: An Online Platform for Community-Curated Data Sets and Tutorials for Machine Learning in Proteomics.
Tobias G. Rehfeldt*, Ralf Gabriels*, Robbin Bouwmeester*, Siegfried Gessulat, Benjamin A. Neely, Magnus Palmblad, Yasset Perez-Riverol, Tobias Schmidt, Juan Antonio Vizcaı́no§, and Eric W. Deutsch§.
J. Proteome Res. 2023, 22, 2, 632–636. doi:10.1021/acs.jproteome.2c00629.

Introduction

Computational predictions of analyte behavior in the context of mass spectrometry (MS) data have been explored for nearly five decades, with early rudimentary predictions dating back to 1983. (Heijne 1983) With the rise of technology and computational power, machine learning (ML) approaches were introduced into the field of proteomics in 1998 (Nielsen, Brunak, and Heijne 1999) and ML-based models quickly overtook human accuracy. Since then, dozens of articles have described efforts to train models for a multitude of physicochemical properties associated with the field of high-throughput proteomics, as reviewed by Neely et al. (Neely et al. 2023) Some of the most-commonly studied properties are retention time and fragmentation spectrum intensities, while a large range of lesser explored properties exists as well. For an exhaustive review of the current undertakings, see Wen et al. and Bouwmeester et al. (Wen et al. 2020; Bouwmeester et al. 2020) While many of these efforts are still in the realm of basic exploratory research, ML approaches are increasingly being incorporated into mainstream tools and standalone predictive resources. (Wen et al. 2020; Gessulat et al. 2019; Bouwmeester et al. 2021; Meyer 2021)

When training any ML model, it is crucial to obtain suitable training and evaluation data sets. Likewise, in many fields of research where ML is applied, it is common to have a range of educational data sets, such as the MNIST (Modified National Institute of Standards and Technology) (Deng 2012) or IRIS data sets, allowing newcomers to the field to easily learn common ML methodologies. Likewise, state-of-the-art models can use benchmark data sets such as ImageNet or those available on the UCI Machine Learning Repository to compare their predictive capabilities. Similar to the utility of benchmark data sets, such as the number of survivors on the Titanic, which has been modeled more than 54 000 times (kaggle.com/competitions/titanic), we seek to define proteomics data sets that can provide an entry point for ML modeling.

Although there have been numerous efforts to explore the predictive capabilities of models, there are barriers that limit widespread adoption in the field of predictive proteomics. First, there are considerable difficulties in accessing data sets in a suitable form for ML applications. A substantial effort is required to prepare raw proteomics data sets into a format usable for ML, as this demands extensive knowledge of the multitude of proteomics file formats and postprocessing methods. MS data also has a tendency to be fraught with missing metadata, making it challenging to compare across data sets. Furthermore, most ML frameworks in proteomics implement dedicated postprocessing pipelines to prepare the files for ML algorithms. Recently, tools such as ppx (Fondrie, Bittremieux, and Noble 2021) and MS2AI (Rehfeldt et al. 2021) were created to facilitate this process, but they are still limited to certain use cases due to the complex nature of liquid chromatography coupled to mass spectrometry (LC-MS) data.

Second, while some ML-ready data sets are available on platforms such as Kaggle (Kaggle.com, n.d.) or in supplementary tables of publications, they are often difficult to find and lack long-term maintenance and support postpublication. While there is no formal consensus in the field, there are certain data sets that are often used for training such as ProteomeTools. (Zolg et al. 2017) Nevertheless, there are no widely used data sets used to compare the performance of tools developed by different researchers, making it difficult for new algorithms to be evaluated and compared to older tools. This issue is only further exacerbated by individual groups relying on different pre- and postprocessing protocols, such as differences in normalization of measurements or in the implementation of model performance metrics.

As an outcome of the 2022 Lorentz Center Workshop on Proteomics and Machine Learning (Leiden, The Netherlands, March 2022), we have created a web platform to facilitate the application of ML approaches to the field of MS-based proteomics. The resource is intended to provide a central focal point for curating and disseminating data sets that are ready to use for ML research, and to encourage new entrants into the field through expert-driven tutorials. Here we describe how ProteomicsML has been developed using commonly available tools and designed for future ease of maintenance. We provide a brief overview of the data sets that are currently available at ProteomicsML and how it can be expanded in the future with more data. We also describe the initial set of tutorials that can be used as an introduction to the field of ML in proteomics.

The ProteomicsML Platform

The primary entry point for the resource is the ProteomicsML web site (www.proteomicsml.org). It contains general introductory data sets that are already preprocessed and ready for training or evaluation, and contains educational resources in the form of tutorials for those new to ML in proteomics. The code base for the Web site is maintained via a GitHub repository, and is therefore easy to maintain and amenable to outside contributions from the community. On the GitHub repository, researchers can open pull requests (proposals for adding or changing information) for new data sets or tutorials. These pull requests are then reviewed by the maintainers, currently the authors of this paper, in line with the guidelines in the contributing section of the ProteomicsML Web site. Data sets and tutorials hosted as part of the GitHub repository fall under the CC BY 4.0 license, as indicated on both the repository and the Web site. The PRIDE database infrastructure (Perez-Riverol et al. 2022) is also used to store larger data sets on an FTP server dedicated to ProteomicsML.

A key goal of ProteomicsML is to advance with the field, which is why we provide a platform with detailed documentation, including a contributing guide on how to upload data sets and tutorials for specific ML workflows or algorithms. After curation by the maintainers, the contributions have to pass a build test in order to maintain integrity of the platform, and, if passed, are automatically published on the Web site and are freely accessible to other researchers.

For many LC-MS properties, such as retention time and fragmentation intensity, well-performing ML models have already been published. We aim to provide suitable data sets and tutorials to easily reproduce these results in an educational fashion. All data sets on the platform are organized by data type, and should ideally be provided in a simple data format that is suitable for direct import into ML toolkits. Each data type can contain one or more data sets for different purposes, and each data set should be sufficiently annotated with metadata (e.g., its origin, how it was processed, and the relevant literature citations). Along with well-annotated data sets, the platform provides users with in-depth tutorials on how to download, import, handle, and train various ML models. Many of the LC-MS data types require certain, sometimes complex, preprocessing steps in order to be fully compatible with ML frameworks. For this reason, we believe it is crucial to provide guidelines on these processes to ultimately lower the entry barriers for new users to the field. Tutorials on ProteomicsML can be attribute- or data set-specific, allowing new tutorial submissions to focus on either the direct interactions with specific ML models or methodologies, or on a certain aspect of data preprocessing.

Often when new modeling approaches are published, they are accompanied by data sets with novel pre- and postprocessing steps. Using ProteomicsML, the new data can be uploaded to the site along with a unified metadata entry and an accompanying tutorial that improves reproducibility of the work and facilitates benchmarking by the community.

Data Sets and Tutorials

The original raw data for proteomics data sets currently included in ProteomicsML have already been made publicly available through ProteomeXchange, (Deutsch et al. 2020) mostly via the PRIDE database. (Perez-Riverol et al. 2022) Here, the data hosted at ProteomicsML are provided in an ML-ready format, with links to original metadata and raw files for full provenance. Even though the data sets at ProteomicsML do not contain raw files, we do provide users with extensive tutorials on how to process raw data into ML-ready formats. ProteomicsML currently contains data sets and tutorials for fragmentation intensity, ion mobility (IM), retention time, and protein detectability. More data types can easily be added in the future, as the platform evolves along with the field.

  1. Retention time. Due to retention time playing a major role in modern peptide identification workflows, it is one of the most explored properties in predictive proteomics. (Wen et al. 2020) While some data sets for predicting retention time already exists, such as the publicly available data set from Kaggle kaggle.com/datasets/kirillpe/proteomics-retention-time-prediction and the DLOmix data sets, we have also compiled new multitiered ML-ready data sets from the ProteomeTools synthetic peptide library, (Zolg et al. 2017) in three specific sizes: 100 000 data points (small), well suited for new practitioners; (ii) 250 000 data points (medium), and (iii) 1 million data points (large), well suited for larger-scale ML training or benchmarking. As amino acid modifications can complicate the application of ML in proteomics, these three tiers do not contain any modified peptides except for carbamidomethylation of cysteine. Nevertheless, to train models for more real-life applications, we have also included an additional data set tier containing 200 000 oxidized peptides, as well as a mixed data set containing 200 000 oxidized and 200 000 unmodified peptides. These data sets require minimal data preparation, although we still provide two distinct tutorials on methods to incorporate these data sets into deep learning (DL)-based models. In addition to preprocessed data, we also provide a detailed tutorial that combines and aligns retention times between runs from MaxQuant evidence files. (Tyanova, Temu, and Cox 2016) The output of this tutorial is a fully ML-ready file for retention time prediction.

  2. Fragmentation intensity. While it is easy to calculate the m/z values of theoretical peptide spectra, fragment ion peak intensities follow complex patterns that can be hard to predict. Nevertheless, these intensities can play a key role in accurate peptide identification. (C Silva et al. 2019) For this reason, fragment ion intensity prediction is likely the second most explored topic for prediction purposes, for which comprehensive data sets and tutorials exist within ProteomicsML. As there are many attributes of peptides that affect their fragmentation patterns, the preprocessing steps of fragmentation data are more complex, and can be substantially different from lab to lab. For this reason, we have composed two separate tutorials, one that mimics the Prosit (Gessulat et al. 2019) data processing approach on the ProteomeTools (Zolg et al. 2017) data sets, which consists of 745 000 annotated spectra, and one that mimics the MS2PIP data process on a consensus human spectral library from the National Institute of Standards and Technology, which consists of 270 440 annotated spectra. (Gabriels, Martens, and Degroeve 2019) For data sets in this category it is difficult to provide a simple format with unified columns, as the handling and preprocessing steps differ significantly from model to model. Currently, there is one tutorial available on ProteomicsML describing the data processing pipeline from raw file to Prosit-style annotation, and we believe that with future additions we can provide users with tutorials for additional processing approaches.

  3. Ion mobility. Ion mobility is a technique to separate ionized analytes based on their size, shape, and physicochemical properties. (Dodds and Baker 2019) Techniques for ion mobility are generally based on propelling or trapping ions with an electric field in an ion mobility cell. Peptides are then separated by colliding them with an inert gas without fragmentation. Indeed, peptides with a larger area to collide will be more affected by the collisions, resulting in a higher measured collisional cross section (CCS). Historically, most methods predicting ion mobility were based on molecular dynamics models that calculate the CCS from first-principles in physics. (Larriba-Andaluz and Prell 2020) Lately the field has generated multiple ML and DL approaches for both peptide and metabolite CCS prediction. (Zhou, Xiong, and Zhu 2017; Broeckling et al. 2021; Meier et al. 2021) The tutorials made available in ProteomicsML use both trapping (trapped ion mobility, (Michelmann et al. 2015) TIMS) and propelling ion mobility (traveling wave ion mobility, (Shvartsburg and Smith 2008) TWIMS) data, where the large TIMS data set was sourced from Meier et al. (Meier et al. 2021) (718 917 data points) and the TWIMS data was sourced from Puyvelde et al. (Van Puyvelde et al. 2022) (6268 data points). The tutorial is a walkthrough for training various model types, ranging from simple linear models to more complex nonlinear models (e.g., DL-based networks) showing advantages and disadvantages of various learning algorithms for CCS prediction.

  4. Protein detectability. Modern proteomics methods and instrumentation are now routinely detecting and quantifying the majority of proteins thought to be encoded by the genome of a given species. (Hebert et al. 2014) Yet even after gathering enormous amounts of data, there is always a subset of proteins that remains refractory to detection. For example, even though tremendous effort has been focused on the human proteome, the fraction of unobserved proteins has been pushed just below 10%. (Adhikari et al. 2020; Omenn et al. 2021) It remains unclear why certain proteins remain undetected, although ML has been applied to explore which properties most strongly influence detectability (as reviewed within). (Dincer et al. 2022) One can compute a set of properties for a proteome and then train a model using those properties based on real world observations of the proteins that are detected and the proteins that are not detected. The model can be trained to learn which properties separate the detected from the undetected. Such a model has further utility to highlight proteins with properties that should sort them into the detected group, yet are not, as well as proteins that should belong to the undetected group, and yet they are detected. To facilitate this we have included the Arabidopsis PeptideAtlas data set, which is based on an extensive study of a single proteome. (Wijk et al. 2021) This data set is based on the 2021 build, which has 52 data sets reprocessed to yield 40 million peptide-spectrum matches and a good overall coverage of the Arabidopsis thaliana proteome. Proteins in the data set are categorized as either “canonical”, having the strongest evidence of detection, or “not observed”, for which no peptides are identified. Along with these class labels, the data set contains various protein properties such as molecular weight, hydrophobicity, and isoelectric point, which could be crucial for classification purposes. The data set has an accompanying tutorial that illustrates how to analyze the data with a classification model for the observability of peptides.

Overall, these initial data set submissions and tutorials leave room for future expansion, until the community resource contains data sets for all properties previously and currently being explored in the field of proteomics. It is also open for user submissions, allowing researchers to upload their data in a standardized fashion, along with in-depth tutorials on their data handling and ML methodologies, resulting in more reproducible science. Our expectation is that this will shape the future of predictive proteomics, in favor of being more accessible, standardized, and reproducible.

Additionally, we have compiled a list of proteomics publications that utilize ML, along with a list of ProteomeXchange data sets used by each of the publications (Supplementary Table 1). Each of these ProteomeXchange data sets have been given a set of tags to indicate the nature of the usage in the publications (e.g., benchmarking, retention time, deep learning, etc.) as shown in Supplementary Table 2. Furthermore, these tags have also been added to the respective PRIDE data sets, which allows the tags to be easily searched, and for users to compile their ideal data set, if ProteomicsML does not already contain one.

Conclusion

We have presented ProteomicsML, a comprehensive resource of data sets and tutorials for every ML practitioner in the field of MS-based proteomics. ProteomicsML contains multiple data sets on a range of LC-MS peptide properties, allowing computational proteomics researchers to compare new algorithms to state-of-the-art models, as well as providing newcomers to the field with an accessible starting point, without requiring immediate in-depth knowledge of the entire proteomics analysis pipeline. We believe that this resource will aid the next generation of ML practitioners, and provide a stepping stone for more open and more reproducible science in the field.

Supporting Information

The Supporting Information is available free of charge at pubs.acs.org/doi/10.1021/acs.jproteome.2c00629.

  • Supplementary Table 1: Proteomics ML publications along with links to the ProteomeXchange data sets used for training or testing (XLSX)

  • Supplementary Table 2: Public ProteomeXchange data sets that have been used for ML training or benchmarking (XLSX)

Notes

The authors declare the following competing financial interest(s): Tobias Schmidt and Siegfried Gessulat are employees of MSAID. MSAID makes ML-based software modules that are sold as part of Proteome Discoverer and also offers contract research. All other authors declare no competing financial interest.

Identification of certain commercial equipment, instruments, software, or materials does not imply recommendation or endorsement by the National Institute of Standards and Technology (NIST), nor does it imply that the products identified are necessarily the best available for the proposed purpose.

Acknowledgments

We thank Wassim Gabriel and Mathias Wilhelm for consultations on the Prosit annotation pipeline. The 2022 Lorentz Center workshop on Proteomics and Machine Learning was funded by the Dutch Research Council (NWO) with generous support from the Leiden University Medical Center, Thermo Fisher Scientific and Journal of Proteome Research (ACS). We also thank the staff at the Lorentz Center for helping make the hybrid workshop a success in pandemic times. T.G.R. acknowledges funding from the Velux Foundation [00028116]. R.G. acknowledges funding from the Research Foundation Flanders (FWO) [12B7123N]. R.B. acknowledges funding from the Vlaams Agentschap Innoveren en Ondernemen [HBC.2020.2205]. J.A.V. acknowledges funding from EMBL core funding, Wellcome [grant 223745/Z/21/Z], EU H2020 [823839], and BBSRC [BB/S01781X/1; BB/V018779/1]. E.W.D. acknowledges funding from the National Institutes of Health [R01 GM087221; R24 GM127667; U19 AG023122], and from the National Science Foundation [DBI-1933311; IOS-1922871].

References

Adhikari, Subash, Edouard C Nice, Eric W Deutsch, Lydie Lane, Gilbert S Omenn, Stephen R Pennington, Young-Ki Paik, et al. 2020. “A High-Stringency Blueprint of the Human Proteome.” Nat. Commun. 11 (1): 5301. https://doi.org/10.1038/s41467-020-19045-9.
Bouwmeester, Robbin, Ralf Gabriels, Niels Hulstaert, Lennart Martens, and Sven Degroeve. 2021. DeepLC Can Predict Retention Times for Peptides That Carry as-yet Unseen Modifications.” Nat. Methods 18 (11): 1363–69. https://doi.org/10.1038/s41592-021-01301-5.
Bouwmeester, Robbin, Ralf Gabriels, Tim Van Den Bossche, Lennart Martens, and Sven Degroeve. 2020. “The Age of Data-Driven Proteomics: How Machine Learning Enables Novel Workflows.” PROTEOMICS 20 (21-22): 1900351. https://doi.org/https://doi.org/10.1002/pmic.201900351.
Broeckling, Corey D, Linxing Yao, Giorgis Isaac, Marisa Gioioso, Valentin Ianchis, and Johannes P C Vissers. 2021. “Application of Predicted Collisional Cross Section to Metabolome Databases to Probabilistically Describe the Current and Future Ion Mobility Mass Spectrometry.” J. Am. Soc. Mass Spectrom. 32 (3): 661–69. https://doi.org/10.1021/jasms.0c00375.
C Silva, Ana S, Robbin Bouwmeester, Lennart Martens, and Sven Degroeve. 2019. “Accurate Peptide Fragmentation Predictions Allow Data Driven Approaches to Replace and Improve Upon Proteomics Search Engine Scoring Functions.” Bioinformatics 35 (24): 5243–48. https://doi.org/10.1093/bioinformatics/btz383.
Deng, Li. 2012. “The MNIST Database of Handwritten Digit Images for Machine Learning Research [Best of the Web].” IEEE Signal Processing Magazine 29 (6): 141–42. https://doi.org/10.1109/MSP.2012.2211477.
Deutsch, Eric W, Nuno Bandeira, Vagisha Sharma, Yasset Perez-Riverol, Jeremy J Carver, Deepti J Kundu, David Garcı́a-Seisdedos, et al. 2020. “The ProteomeXchange Consortium in 2020: Enabling ’Big Data’ Approaches in Proteomics.” Nucleic Acids Res. 48 (D1): D1145–52. https://doi.org/10.1093/nar/gkz984.
Dincer, Ayse B., Yang Lu, Devin K. Schweppe, Sewoong Oh, and William Stafford Noble. 2022. “Reducing Peptide Sequence Bias in Quantitative Mass Spectrometry Data with Machine Learning.” Journal of Proteome Research 21 (7): 1771–82. https://doi.org/10.1021/acs.jproteome.2c00211.
Dodds, James N, and Erin S Baker. 2019. “Ion Mobility Spectrometry: Fundamental Concepts, Instrumentation, Applications, and the Road Ahead.” J. Am. Soc. Mass Spectrom. 30 (11): 2185–95. https://doi.org/10.1007/s13361-019-02288-2.
Fondrie, William E, Wout Bittremieux, and William S Noble. 2021. ppx: Programmatic Access to Proteomics Data Repositories.” J. Proteome Res. 20 (9): 4621–24. https://doi.org/10.1021/acs.jproteome.1c00454.
Gabriels, Ralf, Lennart Martens, and Sven Degroeve. 2019. “Updated MS²PIP Web Server Delivers Fast and Accurate MS² Peak Intensity Prediction for Multiple Fragmentation Methods, Instruments and Labeling Techniques.” Nucleic Acids Res. 47 (W1): W295–99. https://doi.org/10.1093/nar/gkz299.
Gessulat, Siegfried, Tobias Schmidt, Daniel Paul Zolg, Patroklos Samaras, Karsten Schnatbaum, Johannes Zerweck, Tobias Knaute, et al. 2019. “Prosit: Proteome-Wide Prediction of Peptide Tandem Mass Spectra by Deep Learning.” Nat. Methods 16 (6): 509–18. https://doi.org/10.1038/s41592-019-0426-7.
Hebert, Alexander S, Alicia L Richards, Derek J Bailey, Arne Ulbrich, Emma E Coughlin, Michael S Westphall, and Joshua J Coon. 2014. “The One Hour Yeast Proteome.” Mol. Cell. Proteomics 13 (1): 339–47. https://doi.org/10.1074/mcp.M113.034769.
Heijne, G von. 1983. “Patterns of Amino Acids Near Signal-Sequence Cleavage Sites.” Eur. J. Biochem. 133 (1): 17–21. https://doi.org/10.1111/j.1432-1033.1983.tb07424.x.
Kaggle.com. n.d. Kaggle. https://www.kaggle.com/datasets?search=proteomics.
Larriba-Andaluz, Carlos, and James S Prell. 2020. “Fundamentals of Ion Mobility in the Free Molecular Regime. Interlacing the Past, Present and Future of Ion Mobility Calculations.” Int. Rev. Phys. Chem. 39 (4): 569–623. https://doi.org/10.1080/0144235X.2020.1826708.
Meier, Florian, Niklas D Köhler, Andreas-David Brunner, Jean-Marc H Wanka, Eugenia Voytik, Maximilian T Strauss, Fabian J Theis, and Matthias Mann. 2021. “Deep Learning the Collisional Cross Sections of the Peptide Universe from a Million Experimental Values.” Nat. Commun. 12 (1): 1185. https://doi.org/10.1038/s41467-021-21352-8.
Meyer, Jesse G. 2021. “Deep Learning Neural Network Tools for Proteomics.” Cell Rep Methods 1 (2): 100003. https://doi.org/10.1016/j.crmeth.2021.100003.
Michelmann, Karsten, Joshua A Silveira, Mark E Ridgeway, and Melvin A Park. 2015. “Fundamentals of Trapped Ion Mobility Spectrometry.” J. Am. Soc. Mass Spectrom. 26 (1): 14–24. https://doi.org/10.1007/s13361-014-0999-4.
Neely, Benjamin A., Viktoria Dorfer, Lennart Martens, Isabell Bludau, Robbin Bouwmeester, Sven Degroeve, Eric W. Deutsch, et al. 2023. “Toward an Integrated Machine Learning Model of a Proteomics Experiment.” Journal of Proteome Research 22 (3): 681–96. https://doi.org/10.1021/acs.jproteome.2c00711.
Nielsen, H, S Brunak, and G von Heijne. 1999. “Machine Learning Approaches for the Prediction of Signal Peptides and Other Protein Sorting Signals.” Protein Eng. 12 (1): 3–9. https://doi.org/10.1093/protein/12.1.3.
Omenn, Gilbert S, Lydie Lane, Christopher M Overall, Young-Ki Paik, Ileana M Cristea, Fernando J Corrales, Cecilia Lindskog, et al. 2021. “Progress Identifying and Analyzing the Human Proteome: 2021 Metrics from the HUPO Human Proteome Project.” J. Proteome Res. 20 (12): 5227–40. https://doi.org/10.1021/acs.jproteome.1c00590.
Perez-Riverol, Yasset, Jingwen Bai, Chakradhar Bandla, David Garcı́a-Seisdedos, Suresh Hewapathirana, Selvakumar Kamatchinathan, Deepti J Kundu, et al. 2022. “The PRIDE Database Resources in 2022: A Hub for Mass Spectrometry-Based Proteomics Evidences.” Nucleic Acids Res. 50 (D1): D543–52. https://doi.org/10.1093/nar/gkab1038.
Rehfeldt, Tobias Greisager, Konrad Krawczyk, Mathias Bøgebjerg, Veit Schwämmle, and Richard Röttger. 2021. MS2AI: Automated Repurposing of Public Peptide LC-MS Data for Machine Learning Applications.” Bioinformatics, October. https://doi.org/10.1021/acs.analchem.9b01262.
Shvartsburg, Alexandre A, and Richard D Smith. 2008. “Fundamentals of Traveling Wave Ion Mobility Spectrometry.” Anal. Chem. 80 (24): 9689–99. https://doi.org/10.1021/ac8016295.
Tyanova, Stefka, Tikira Temu, and Juergen Cox. 2016. “The MaxQuant Computational Platform for Mass Spectrometry-Based Shotgun Proteomics.” Nature Protocols 11 (12): 2301–19. https://doi.org/10.1038/nprot.2016.136.
Van Puyvelde, Bart, Simon Daled, Sander Willems, Ralf Gabriels, Anne Gonzalez de Peredo, Karima Chaoui, Emmanuelle Mouton-Barbosa, et al. 2022. “A Comprehensive LFQ Benchmark Dataset on Modern Day Acquisition Strategies in Proteomics.” Sci Data 9 (1): 126. https://doi.org/10.1038/s41597-022-01216-6.
Wen, Bo, Wen-Feng Zeng, Yuxing Liao, Zhiao Shi, Sara R Savage, Wen Jiang, and Bing Zhang. 2020. “Deep Learning in Proteomics.” Proteomics 20 (21-22). https://doi.org/10.1002/pmic.201900335.
Wijk, Klaas J van, Tami Leppert, Qi Sun, Sascha S Boguraev, Zhi Sun, Luis Mendoza, and Eric W Deutsch. 2021. “The Arabidopsis PeptideAtlas: Harnessing Worldwide Proteomics Data to Create a Comprehensive Community Proteomics Resource.” Plant Cell 33 (11): 3421–53. https://doi.org/10.1093/plcell/koab211.
Zhou, Zhiwei, Xin Xiong, and Zheng-Jiang Zhu. 2017. MetCCS Predictor: A Web Server for Predicting Collision Cross-Section Values of Metabolites in Ion Mobility-Mass Spectrometry Based Metabolomics.” Bioinformatics 33 (14): 2235–37. https://doi.org/10.1093/bioinformatics/btx140.
Zolg, Daniel P, Mathias Wilhelm, Karsten Schnatbaum, Johannes Zerweck, Tobias Knaute, Bernard Delanghe, Derek J Bailey, et al. 2017. “Building ProteomeTools Based on a Complete Synthetic Human Proteome.” Nat. Methods 14 (3): 259–62. https://doi.org/10.1038/nmeth.4153.