Raw file processing with PROSIT style annotation

Author
Affiliations

Tobias Greisager Rehfeldt

University of Southern Denmark, Odense

Department of Natural Science, Institute for Mathematics and Computer Science

Published

September 21, 2022

This notebook contains the simplest steps to turn any raw data into a format thats fragmentation prediction ready. This notebook retrieve a ProteomeTools file from PRIDE to make it as easy to copy as possible, but retrieving the files might take time.

This method uses the MaxQuant file to get the modified sequence, charge, and scan number. It then uses fisher_py to interact with the raw files and retrieve the ms2 scans and the mass analyzer.

The annotation pipeline comes from the TUM annotation github

%%capture
# In order to interact with fisher raw files, we need to interact with the python .NET implementation.
# This requires CONDA on all UNIX systems, and for this reason we need to install conda in the colab.
# If this is not run on colab do not run this code block, but install conda in the given environment.
!pip install -q condacolab
import condacolab
condacolab.install()
%%capture
!conda install pythonnet==2.5.2
!pip install fisher_py==1.0.10
!pip install fundamentals@git+https://github.com/wilhelm-lab/spectrum_fundamentals@proteomicsml
!wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw
!wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip
--2022-11-01 10:51:16--  https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw
Resolving ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)|193.62.193.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 687962554 (656M)
Saving to: ‘01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw’

01625b_GA1-TUM_firs 100%[===================>] 656.09M   676KB/s    in 16m 45s 

2022-11-01 11:08:02 (669 KB/s) - ‘01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw’ saved [687962554/687962554]

--2022-11-01 11:08:02--  https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip
Resolving ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)|193.62.193.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15581179 (15M) [application/zip]
Saving to: ‘TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip’

TUM_first_pool_1_01 100%[===================>]  14.86M   685KB/s    in 23s     

2022-11-01 11:08:25 (671 KB/s) - ‘TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip’ saved [15581179/15581179]
from zipfile import ZipFile
import pandas as pd
with ZipFile(f'TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip', 'r') as zip_file:
  msms = pd.read_csv(zip_file.open('msms.txt'), sep='\t')
# Current PROSIT pipeline does not accomodate modified peptides, so we remove all of the oxidized peptides
msms = msms[msms['Modifications'] == 'Unmodified']
from fisher_py import RawFile
raw = RawFile('01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw')
# Get the scan numbers from the msms file and save the scan + info in a dictionary
from fisher_py.data.business import Scan
import numpy as np
scan_mzs = []
scan_ints = []
scan_mass_analyzers = []
scan_collison_energy = []
for scan in msms['Scan number']:
  raw_scan = Scan.from_file(raw._raw_file_access, scan)
  scan_mzs.append(np.array(raw_scan.preferred_masses))
  scan_ints.append(np.array(raw_scan.preferred_intensities))
  scan_mass_analyzers.append(raw_scan.scan_type.split(' + ')[0])
  frag_infos = [f.split(' ')[0] for f in raw_scan.scan_type.split('@')[1:]]
  splits = [[i for i, g in enumerate(f) if g.isnumeric()][0] for f in frag_infos]
  NCEs = [float(frag[split:]) for split, frag in zip(splits, frag_infos)]
  scan_collison_energy.append(NCEs[0])

We need to create a sub-set of the MaxQuant dataframe that we can insert into the annotation pipeline. For this we need the have 6 columns (with specific names): MODIFIED_SEQUENCE, PERCURSOR_CHARGE, MASS_ANALYZER, SCAN_NUMBER, MZ, INTENSITIES

annotation_df = pd.DataFrame(msms[['Modified sequence', 'Charge', 'Scan number']].values, columns=['MODIFIED_SEQUENCE', 'PRECURSOR_CHARGE', 'SCAN_NUMBER'])
annotation_df['MZ'] = scan_mzs
annotation_df['INTENSITIES'] = scan_ints
annotation_df['MASS_ANALYZER'] = scan_mass_analyzers
annotation_df['COLLISION_ENERGY'] = scan_collison_energy

from fundamentals.mod_string import maxquant_to_internal
annotation_df['MODIFIED_SEQUENCE'] = maxquant_to_internal(annotation_df['MODIFIED_SEQUENCE'].values)

from fundamentals.annotation.annotation import annotate_spectra
annotation = annotate_spectra(annotation_df)
2022-11-01 11:49:04,339 - INFO - fundamentals.annotation.annotation::annotate_spectra Removed count    11970.00000
mean         0.00802
std          0.09287
min          0.00000
25%          0.00000
50%          0.00000
75%          0.00000
max          2.00000
Name: removed_peaks, dtype: float64 redundant peaks

The annotation element contains the annotated intensities nad m/zs, along with the theoretical mass and removed peaks

annotation
INTENSITIES MZ CALCULATED_MASS removed_peaks
0 [0.36918813165578857, 0.0, -1.0, 0.0, 0.0, -1.... [175.11929321289062, 0.0, -1.0, 0.0, 0.0, -1.0... 796.423175 0
1 [0.028514689782729, 0.0, -1.0, 0.0, 0.0, -1.0,... [175.25360107421875, 0.0, -1.0, 0.0, 0.0, -1.0... 796.423175 0
2 [0.3452339640378655, 0.0, -1.0, 0.0, 0.0, -1.0... [175.11927795410156, 0.0, -1.0, 0.0, 0.0, -1.0... 796.423175 0
3 [0.030064791591335877, 0.0, -1.0, 0.0, 0.0, -1... [175.16168212890625, 0.0, -1.0, 0.0, 0.0, -1.0... 796.423175 0
4 [0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 0.07584115481... [0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 262.248901367... 1370.559481 0
... ... ... ... ...
11965 [0.009784486409648692, 0.0, -1.0, 0.0, 0.0, -1... [147.1424102783203, 0.0, -1.0, 0.0, 0.0, -1.0,... 914.474935 0
11966 [0.23857646569260368, 0.0, -1.0, 0.0, 0.0, -1.... [147.11309814453125, 0.0, -1.0, 0.0, 0.0, -1.0... 914.474935 0
11967 [0.012048242613237779, 0.0, -1.0, 0.0, 0.0, -1... [147.1204376220703, 0.0, -1.0, 0.0, 0.0, -1.0,... 914.474935 0
11968 [0.39071905153057307, 0.0, -1.0, 0.0, 0.0, -1.... [147.11328125, 0.0, -1.0, 0.0, 0.0, -1.0, 276.... 914.474935 0
11969 [0.02029996314040732, 0.0, -1.0, 0.0, 0.0, -1.... [147.19485473632812, 0.0, -1.0, 0.0, 0.0, -1.0... 914.474935 0

11970 rows × 4 columns

Now we need to combined the necessary information from MaxQuant and the annotation package into a DataFrame mimicing the one found in the “Prosit-style GRU with ProteomeTools data” found here (https://www.proteomicsml.org/tutorials/fragmentation/proteometools-prosit.html) for an easy handover

PROSIT_ALHABET = {
    "A": 1,
    "C": 2,
    "D": 3,
    "E": 4,
    "F": 5,
    "G": 6,
    "H": 7,
    "I": 8,
    "K": 9,
    "L": 10,
    "M": 11,
    "N": 12,
    "P": 13,
    "Q": 14,
    "R": 15,
    "S": 16,
    "T": 17,
    "V": 18,
    "W": 19,
    "Y": 20,
    "M(ox)": 21,
}
sequence_integer = [[PROSIT_ALHABET[AA] for AA in sequence] for sequence in msms['Sequence']]
precursor_charge_onehot = pd.get_dummies(msms['Charge']).values
collision_energy_aligned_normed = annotation_df['COLLISION_ENERGY']
intensities_raw = annotation['INTENSITIES']
df = pd.DataFrame(list(zip(sequence_integer, precursor_charge_onehot, collision_energy_aligned_normed, intensities_raw)),
                  columns=['sequence_integer', 'precursor_charge_onehot', 'collision_energy', 'intensities_raw'])
df
sequence_integer precursor_charge_onehot collision_energy intensities_raw
0 [1, 1, 1, 5, 20, 18, 15] [0, 1, 0] 28.0 [0.36918813165578857, 0.0, -1.0, 0.0, 0.0, -1....
1 [1, 1, 1, 5, 20, 18, 15] [0, 1, 0] 35.0 [0.028514689782729, 0.0, -1.0, 0.0, 0.0, -1.0,...
2 [1, 1, 1, 5, 20, 18, 15] [0, 1, 0] 28.0 [0.3452339640378655, 0.0, -1.0, 0.0, 0.0, -1.0...
3 [1, 1, 1, 5, 20, 18, 15] [0, 1, 0] 35.0 [0.030064791591335877, 0.0, -1.0, 0.0, 0.0, -1...
4 [1, 1, 5, 17, 4, 2, 2, 14, 1, 1, 3, 9] [0, 1, 0] 35.0 [0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 0.07584115481...
... ... ... ... ...
11965 [20, 20, 16, 8, 10, 4, 9] [0, 1, 0] 35.0 [0.009784486409648692, 0.0, -1.0, 0.0, 0.0, -1...
11966 [20, 20, 16, 8, 10, 4, 9] [0, 1, 0] 28.0 [0.23857646569260368, 0.0, -1.0, 0.0, 0.0, -1....
11967 [20, 20, 16, 8, 10, 4, 9] [0, 1, 0] 35.0 [0.012048242613237779, 0.0, -1.0, 0.0, 0.0, -1...
11968 [20, 20, 16, 8, 10, 4, 9] [0, 1, 0] 28.0 [0.39071905153057307, 0.0, -1.0, 0.0, 0.0, -1....
11969 [20, 20, 16, 8, 10, 4, 9] [0, 1, 0] 35.0 [0.02029996314040732, 0.0, -1.0, 0.0, 0.0, -1....

11970 rows × 4 columns