%%capture
# In order to interact with fisher raw files, we need to interact with the python .NET implementation.
# This requires CONDA on all UNIX systems, and for this reason we need to install conda in the colab.
# If this is not run on colab do not run this code block, but install conda in the given environment.
!pip install -q condacolab
import condacolab
condacolab.install()
Raw file processing with PROSIT style annotation
This notebook contains the simplest steps to turn any raw data into a format thats fragmentation prediction ready. This notebook retrieve a ProteomeTools file from PRIDE to make it as easy to copy as possible, but retrieving the files might take time.
This method uses the MaxQuant file to get the modified sequence, charge, and scan number. It then uses fisher_py to interact with the raw files and retrieve the ms2 scans and the mass analyzer.
The annotation pipeline comes from the TUM annotation github
%%capture
!conda install pythonnet==2.5.2
!pip install fisher_py==1.0.10
!pip install fundamentals@git+https://github.com/wilhelm-lab/spectrum_fundamentals@proteomicsml
!wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw
!wget https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip
--2022-11-01 10:51:16-- https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw
Resolving ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)|193.62.193.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 687962554 (656M)
Saving to: ‘01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw’
01625b_GA1-TUM_firs 100%[===================>] 656.09M 676KB/s in 16m 45s
2022-11-01 11:08:02 (669 KB/s) - ‘01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw’ saved [687962554/687962554]
--2022-11-01 11:08:02-- https://ftp.pride.ebi.ac.uk/pride/data/archive/2017/02/PXD004732/TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip
Resolving ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)... 193.62.193.138
Connecting to ftp.pride.ebi.ac.uk (ftp.pride.ebi.ac.uk)|193.62.193.138|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 15581179 (15M) [application/zip]
Saving to: ‘TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip’
TUM_first_pool_1_01 100%[===================>] 14.86M 685KB/s in 23s
2022-11-01 11:08:25 (671 KB/s) - ‘TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip’ saved [15581179/15581179]
from zipfile import ZipFile
import pandas as pd
with ZipFile(f'TUM_first_pool_1_01_01_DDA-1h-R2-tryptic.zip', 'r') as zip_file:
= pd.read_csv(zip_file.open('msms.txt'), sep='\t')
msms # Current PROSIT pipeline does not accomodate modified peptides, so we remove all of the oxidized peptides
= msms[msms['Modifications'] == 'Unmodified'] msms
from fisher_py import RawFile
= RawFile('01625b_GA1-TUM_first_pool_1_01_01-DDA-1h-R2.raw')
raw # Get the scan numbers from the msms file and save the scan + info in a dictionary
from fisher_py.data.business import Scan
import numpy as np
= []
scan_mzs = []
scan_ints = []
scan_mass_analyzers = []
scan_collison_energy for scan in msms['Scan number']:
= Scan.from_file(raw._raw_file_access, scan)
raw_scan
scan_mzs.append(np.array(raw_scan.preferred_masses))
scan_ints.append(np.array(raw_scan.preferred_intensities))' + ')[0])
scan_mass_analyzers.append(raw_scan.scan_type.split(= [f.split(' ')[0] for f in raw_scan.scan_type.split('@')[1:]]
frag_infos = [[i for i, g in enumerate(f) if g.isnumeric()][0] for f in frag_infos]
splits = [float(frag[split:]) for split, frag in zip(splits, frag_infos)]
NCEs 0]) scan_collison_energy.append(NCEs[
We need to create a sub-set of the MaxQuant dataframe that we can insert into the annotation pipeline. For this we need the have 6 columns (with specific names): MODIFIED_SEQUENCE, PERCURSOR_CHARGE, MASS_ANALYZER, SCAN_NUMBER, MZ, INTENSITIES
= pd.DataFrame(msms[['Modified sequence', 'Charge', 'Scan number']].values, columns=['MODIFIED_SEQUENCE', 'PRECURSOR_CHARGE', 'SCAN_NUMBER'])
annotation_df 'MZ'] = scan_mzs
annotation_df['INTENSITIES'] = scan_ints
annotation_df['MASS_ANALYZER'] = scan_mass_analyzers
annotation_df['COLLISION_ENERGY'] = scan_collison_energy
annotation_df[
from fundamentals.mod_string import maxquant_to_internal
'MODIFIED_SEQUENCE'] = maxquant_to_internal(annotation_df['MODIFIED_SEQUENCE'].values)
annotation_df[
from fundamentals.annotation.annotation import annotate_spectra
= annotate_spectra(annotation_df) annotation
2022-11-01 11:49:04,339 - INFO - fundamentals.annotation.annotation::annotate_spectra Removed count 11970.00000
mean 0.00802
std 0.09287
min 0.00000
25% 0.00000
50% 0.00000
75% 0.00000
max 2.00000
Name: removed_peaks, dtype: float64 redundant peaks
The annotation element contains the annotated intensities nad m/zs, along with the theoretical mass and removed peaks
annotation
INTENSITIES | MZ | CALCULATED_MASS | removed_peaks | |
---|---|---|---|---|
0 | [0.36918813165578857, 0.0, -1.0, 0.0, 0.0, -1.... | [175.11929321289062, 0.0, -1.0, 0.0, 0.0, -1.0... | 796.423175 | 0 |
1 | [0.028514689782729, 0.0, -1.0, 0.0, 0.0, -1.0,... | [175.25360107421875, 0.0, -1.0, 0.0, 0.0, -1.0... | 796.423175 | 0 |
2 | [0.3452339640378655, 0.0, -1.0, 0.0, 0.0, -1.0... | [175.11927795410156, 0.0, -1.0, 0.0, 0.0, -1.0... | 796.423175 | 0 |
3 | [0.030064791591335877, 0.0, -1.0, 0.0, 0.0, -1... | [175.16168212890625, 0.0, -1.0, 0.0, 0.0, -1.0... | 796.423175 | 0 |
4 | [0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 0.07584115481... | [0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 262.248901367... | 1370.559481 | 0 |
... | ... | ... | ... | ... |
11965 | [0.009784486409648692, 0.0, -1.0, 0.0, 0.0, -1... | [147.1424102783203, 0.0, -1.0, 0.0, 0.0, -1.0,... | 914.474935 | 0 |
11966 | [0.23857646569260368, 0.0, -1.0, 0.0, 0.0, -1.... | [147.11309814453125, 0.0, -1.0, 0.0, 0.0, -1.0... | 914.474935 | 0 |
11967 | [0.012048242613237779, 0.0, -1.0, 0.0, 0.0, -1... | [147.1204376220703, 0.0, -1.0, 0.0, 0.0, -1.0,... | 914.474935 | 0 |
11968 | [0.39071905153057307, 0.0, -1.0, 0.0, 0.0, -1.... | [147.11328125, 0.0, -1.0, 0.0, 0.0, -1.0, 276.... | 914.474935 | 0 |
11969 | [0.02029996314040732, 0.0, -1.0, 0.0, 0.0, -1.... | [147.19485473632812, 0.0, -1.0, 0.0, 0.0, -1.0... | 914.474935 | 0 |
11970 rows × 4 columns
Now we need to combined the necessary information from MaxQuant and the annotation package into a DataFrame mimicing the one found in the “Prosit-style GRU with ProteomeTools data” found here (https://www.proteomicsml.org/tutorials/fragmentation/proteometools-prosit.html) for an easy handover
= {
PROSIT_ALHABET "A": 1,
"C": 2,
"D": 3,
"E": 4,
"F": 5,
"G": 6,
"H": 7,
"I": 8,
"K": 9,
"L": 10,
"M": 11,
"N": 12,
"P": 13,
"Q": 14,
"R": 15,
"S": 16,
"T": 17,
"V": 18,
"W": 19,
"Y": 20,
"M(ox)": 21,
}= [[PROSIT_ALHABET[AA] for AA in sequence] for sequence in msms['Sequence']]
sequence_integer = pd.get_dummies(msms['Charge']).values
precursor_charge_onehot = annotation_df['COLLISION_ENERGY']
collision_energy_aligned_normed = annotation['INTENSITIES'] intensities_raw
= pd.DataFrame(list(zip(sequence_integer, precursor_charge_onehot, collision_energy_aligned_normed, intensities_raw)),
df =['sequence_integer', 'precursor_charge_onehot', 'collision_energy', 'intensities_raw'])
columns df
sequence_integer | precursor_charge_onehot | collision_energy | intensities_raw | |
---|---|---|---|---|
0 | [1, 1, 1, 5, 20, 18, 15] | [0, 1, 0] | 28.0 | [0.36918813165578857, 0.0, -1.0, 0.0, 0.0, -1.... |
1 | [1, 1, 1, 5, 20, 18, 15] | [0, 1, 0] | 35.0 | [0.028514689782729, 0.0, -1.0, 0.0, 0.0, -1.0,... |
2 | [1, 1, 1, 5, 20, 18, 15] | [0, 1, 0] | 28.0 | [0.3452339640378655, 0.0, -1.0, 0.0, 0.0, -1.0... |
3 | [1, 1, 1, 5, 20, 18, 15] | [0, 1, 0] | 35.0 | [0.030064791591335877, 0.0, -1.0, 0.0, 0.0, -1... |
4 | [1, 1, 5, 17, 4, 2, 2, 14, 1, 1, 3, 9] | [0, 1, 0] | 35.0 | [0.0, 0.0, -1.0, 0.0, 0.0, -1.0, 0.07584115481... |
... | ... | ... | ... | ... |
11965 | [20, 20, 16, 8, 10, 4, 9] | [0, 1, 0] | 35.0 | [0.009784486409648692, 0.0, -1.0, 0.0, 0.0, -1... |
11966 | [20, 20, 16, 8, 10, 4, 9] | [0, 1, 0] | 28.0 | [0.23857646569260368, 0.0, -1.0, 0.0, 0.0, -1.... |
11967 | [20, 20, 16, 8, 10, 4, 9] | [0, 1, 0] | 35.0 | [0.012048242613237779, 0.0, -1.0, 0.0, 0.0, -1... |
11968 | [20, 20, 16, 8, 10, 4, 9] | [0, 1, 0] | 28.0 | [0.39071905153057307, 0.0, -1.0, 0.0, 0.0, -1.... |
11969 | [20, 20, 16, 8, 10, 4, 9] | [0, 1, 0] | 35.0 | [0.02029996314040732, 0.0, -1.0, 0.0, 0.0, -1.... |
11970 rows × 4 columns