Arabidopsis PeptideAtlas Light and Dark Proteome

Published

November 22, 2024

Downloads


Dataset Description

The dataset contains 32.674 rows totalling 3.3 MB from the Arabidopsis PeptideAtlas build (http://www.peptideatlas.org/builds/arabidopsis/) we have extracted all the “canonical” proteins, which have been observed with at least 2 uniquely mapping peptides of length 9+AA and providing at least 18AA of coverage. We have also extracted “not observed” proteins that have no peptide detections (that pass PeptideAtlas’s stringent thresholds) at all. Physicochemical properties and RNA-seq-based properties are also computed and provided in the dataset.

Attributes

  • title: Arabidopsis PeptideAtlas Light and Dark Proteome
  • dataset tag: detectability/ArabidopsisLightDarkProteome
  • data publication: Plant Cell
  • machine learning publication: None
  • data source identifier: 52 PXDs as listed at PeptideAtlas
  • data type: protein detectability
  • format: TSV
  • columns: protein_identifier, gene_symbol, chromosome, number_of_observations, molecular_weight, gravy_score, isoelectric_point, rna_detected_percent, highest_tpm, protein_description
  • instrument: various
  • organism: Arabidopsis thaliana (arabidopsis)
  • fixed modifications: various
  • variable modification: various
  • dissociation method: CID and HCD
  • collision energy: various
  • mass analyzer type: various

Sample Protocol

No sample protocol is known for the dataset

Data analysis protocol

52 public datasets were downloaded from ProteomeXchange repositories, processed through the PeptideAtlas processing pipeline, and protein categories were computed based on the ensemble data, as described in van Wijk et al. 2021. The number of observations is the number of peptide-spectrum matches in the PeptideAtlas build based on a threshold that aims for a 1% false dicovery rate at the protein level. The molecular weight, gravy score (hydrophobicity), and isoelectric point (pI) are computed in Python via the Pyteomics library. The RNA-seq-based values are computed based on a re-analysis of over 5000 RNA-seq samples as described in Kearly et al. (submitted). The metrics are the percentage of RNA-seq samples with a positive detection of transcripts corresponding to the protein (a measure of how pervasive the transcripts are), and the highest RNA abundance in transcripts per million (TPM) in the highest sample (a measure of the highest possibly abundance at least under some conditions).

Comments

  • None