February 18, 2024


Dataset Description

The dataset has been divided up into training (4.87GB) and holdout (250 MB) of annotated ms2 spectra.


  • title: ProteomeTools synthetic peptides
  • dataset tag: fragmentation/ProteomeTools_FI
  • data publication: ProteomeTools
  • machine learning publication: Prosit
  • data source identifier: PXD004732
  • data type: fragmentation intensity
  • format: hdf5
  • columns: sequence_integer, precursor_charge_onehot, intensities_raw, collision_energy_aligned_normed, collision_energy, precursor_charge sequence_maxquant, sequence_length
  • instrument: Orbitrap Fusion ETD
  • organism: Homo sapiens (human)
  • fixed modifications:
  • variable modification: unmodified
  • dissociation method: CID and HCD
  • collision energy: 35 and 28
  • mass analyzer type: ion and orbitrap
  • spectra encoding: prosit annotation pipeline

Sample protocol description

Tryptic peptides were individually synthesized by solid phase synthesis, combined into pools of ~1,000 peptides and measured on an Orbitrap Fusion mass spectrometer. For each peptide pool, an inclusion list was generated to target peptides for fragmentation in further LC-MS experiments using five fragmentation methods (HCD, CID, ETD, EThCD, ETciD) with ion trap or Orbitrap readout and HCD spectra were recorded at 6 different collision energies.

Data analysis protocol

The ProteomeTools project aims to derive molecular and digital tools from the human proteome to facilitate biomedical and life science research. Here, we describe the generation and multimodal LC-MS/MS analysis of >350,000 synthetic tryptic peptides representing nearly all canonical human gene products. This resource will be extended to 1.4 million peptides within two years and all data will be made available to the public in ProteomicsDB. LC-MS runs were individually analyzed using MaxQuant