NIST (part 2): Traditional ML: Gradient boosting

Author
Affiliations

Ralf Gabriels

VIB-UGent Center for Medical Biotechnology, VIB, Belgium

Department of Biomolecular Medicine, Ghent University, Belgium

Published

October 5, 2022

Modified

November 22, 2024

1. Introduction

This is the second part in a three-part tutorial. We recommend you to to start with the first section, where the NIST spectral library is parsed and prepared for use in the second and third parts.

  1. Preparing a spectral library for ML
  2. Traditional ML: Gradient boosting
  3. Deep learning: BiLSTM

In this tutorial, you will learn how to build a fragmentation intensity predictor similar to MS²PIP v3 (Gabriels, Martens, and Degroeve 2019) with traditional machine learning (ML) feature engineering and Gradient Boosting (Friedman 2002).

# Installing required python packages
! pip install rich~=12.5 numpy~=1.21 pandas~=1.3 matplotlib~=3.5 seaborn~=0.11 scikit-learn~=1.0 pyarrow~=15.0 hyperopt~=0.2 --quiet

2 Data preparation

We will use the spectral library that was already parsed in part 1 of this tutorial series.

import pandas as pd

train_val_spectra = pd.read_feather("http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-trainval.feather")
test_spectra = pd.read_feather("http://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-test.feather")

2.1 Feature engineering

In traditional ML, the input features for the algorithm usually require some engineering. For fragmentation intensity prediction this is not different. Following the MS²PIP methods, we will calculate the distributions of several amino acid properties across the peptide and fragment ion sequences.

Using the distribution of these properties instead of the actual properties per amino acid allows MS²PIP to get a fixed length feature matrix for input peptides with varying lengths.

import numpy as np
import pandas as pd
from rich import progress
amino_acids = list("ACDEFGHIKLMNPQRSTVWY")
properties = np.array([
    [37,35,59,129,94,0,210,81,191,81,106,101,117,115,343,49,90,60,134,104],  # basicity
    [68,23,33,29,70,58,41,73,32,73,66,38,0,40,39,44,53,71,51,55],  # helicity
    [51,75,25,35,100,16,3,94,0,94,82,12,0,22,22,21,39,80,98,70],  # hydrophobicity
    [32,23,0,4,27,32,48,32,69,32,29,26,35,28,79,29,28,31,31,28],  # pI
])

pd.DataFrame(properties, columns=amino_acids, index=["basicity", "helicity", "hydrophobicity", "pI"])
A C D E F G H I K L M N P Q R S T V W Y
basicity 37 35 59 129 94 0 210 81 191 81 106 101 117 115 343 49 90 60 134 104
helicity 68 23 33 29 70 58 41 73 32 73 66 38 0 40 39 44 53 71 51 55
hydrophobicity 51 75 25 35 100 16 3 94 0 94 82 12 0 22 22 21 39 80 98 70
pI 32 23 0 4 27 32 48 32 69 32 29 26 35 28 79 29 28 31 31 28
def encode_peptide(sequence, charge):
    # 4 properties * 5 quantiles * 3 ion types + 4 properties * 4 site + 2 global
    n_features = 78
    quantiles = [0, 0.25, 0.5, 0.75, 1]
    n_ions = len(sequence) - 1

    # Encode amino acids as integers to index amino acid properties for peptide sequence
    aa_indices = {aa: i for i, aa in  enumerate("ACDEFGHIKLMNPQRSTVWY")}
    aa_to_index = np.vectorize(lambda aa: aa_indices[aa])
    peptide_indexed = aa_to_index(np.array(list(sequence)))
    peptide_properties = properties[:, peptide_indexed]

    # Empty peptide_features array
    peptide_features = np.full((n_ions, n_features), np.nan)

    for b_ion_number in range(1, n_ions + 1):
        # Calculate quantiles of features across peptide, b-ion, and y-ion
        peptide_quantiles = np.hstack(
            np.quantile(peptide_properties, quantiles, axis=1).transpose()
        )
        b_ion_quantiles = np.hstack(
            np.quantile(peptide_properties[:,:b_ion_number], quantiles, axis=1).transpose()
        )
        y_ion_quantiles = np.hstack(
            np.quantile(peptide_properties[:,b_ion_number:], quantiles, axis=1).transpose()
        )

        # Properties on specific sites: nterm, frag-1, frag+1, cterm
        specific_site_indexes = np.array([0, b_ion_number - 1, b_ion_number, -1])
        specific_site_properties = np.hstack(peptide_properties[:, specific_site_indexes].transpose())

        # Global features: Length and charge
        global_features = np.array([len(sequence), int(charge)])

        # Assign to peptide_features array
        peptide_features[b_ion_number - 1, 0:20] = peptide_quantiles
        peptide_features[b_ion_number - 1, 20:40] = b_ion_quantiles
        peptide_features[b_ion_number - 1, 40:60] = y_ion_quantiles
        peptide_features[b_ion_number - 1, 60:76] = specific_site_properties
        peptide_features[b_ion_number - 1, 76:78] = global_features

    return peptide_features


def generate_feature_names():
    feature_names = []
    for level in ["peptide", "b", "y"]:
        for aa_property in ["basicity", "helicity", "hydrophobicity", "pi"]:
            for quantile in ["min", "q1", "q2", "q3", "max"]:
                feature_names.append("_".join([level, aa_property, quantile]))
    for site in ["nterm", "fragmin1", "fragplus1", "cterm"]:
        for aa_property in ["basicity", "helicity", "hydrophobicity", "pi"]:
            feature_names.append("_".join([site, aa_property]))

    feature_names.extend(["length", "charge"])
    return feature_names

Let’s test it with a single peptide. Feel free to use your own name as a “peptide”; as long as it does not contain any non-amino acid characters.

peptide_features = pd.DataFrame(encode_peptide("RALFGARIELS", 2), columns=generate_feature_names())
peptide_features
peptide_basicity_min peptide_basicity_q1 peptide_basicity_q2 peptide_basicity_q3 peptide_basicity_max peptide_helicity_min peptide_helicity_q1 peptide_helicity_q2 peptide_helicity_q3 peptide_helicity_max ... fragplus1_basicity fragplus1_helicity fragplus1_hydrophobicity fragplus1_pi cterm_basicity cterm_helicity cterm_hydrophobicity cterm_pi length charge
0 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 37.0 68.0 51.0 32.0 49.0 44.0 21.0 29.0 11.0 2.0
1 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 81.0 73.0 94.0 32.0 49.0 44.0 21.0 29.0 11.0 2.0
2 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 94.0 70.0 100.0 27.0 49.0 44.0 21.0 29.0 11.0 2.0
3 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 0.0 58.0 16.0 32.0 49.0 44.0 21.0 29.0 11.0 2.0
4 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 37.0 68.0 51.0 32.0 49.0 44.0 21.0 29.0 11.0 2.0
5 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 343.0 39.0 22.0 79.0 49.0 44.0 21.0 29.0 11.0 2.0
6 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 81.0 73.0 94.0 32.0 49.0 44.0 21.0 29.0 11.0 2.0
7 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 129.0 29.0 35.0 4.0 49.0 44.0 21.0 29.0 11.0 2.0
8 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 81.0 73.0 94.0 32.0 49.0 44.0 21.0 29.0 11.0 2.0
9 0.0 43.0 81.0 111.5 343.0 29.0 41.5 68.0 71.5 73.0 ... 49.0 44.0 21.0 29.0 49.0 44.0 21.0 29.0 11.0 2.0

10 rows × 78 columns

2.2 Getting the target intensities

The target intensities are the observed intensities which the model will learn to predict. Let’s first try with a single spectrum.

test_spectrum = train_val_spectra.iloc[4]

peptide_targets =  pd.DataFrame({
    "b_target": test_spectrum["parsed_intensity"]["b"],
    "y_target": test_spectrum["parsed_intensity"]["y"],
})
peptide_targets
b_target y_target
0 0.000000 0.118507
1 0.229717 0.079770
2 0.294631 0.088712
3 0.234662 0.145900
4 0.185732 0.205005
5 0.134395 0.261630
6 0.081856 0.305119
7 0.043793 0.296351
8 0.000000 0.205703
9 0.000000 0.155991
10 0.000000 0.000000

These are the intensities for the b- and y-ions, each ordered from 1 to 9. In MS²PIP, however, a clever trick is applied to reuse the computed features for each fragment ion pair. Doing so makes perfect sense, as both ions in such a fragment ion pair originated from the same fragmentation event. For this peptide, the fragment ion pairs are b1-y9, b2-y8, b3-y7, etc. To match all of the pairs, we simply have to reverse the y-ion series intensities:

peptide_targets =  pd.DataFrame({
    "b_target": test_spectrum["parsed_intensity"]["b"],
    "y_target": test_spectrum["parsed_intensity"]["y"][::-1],
})
peptide_targets
b_target y_target
0 0.000000 0.000000
1 0.229717 0.155991
2 0.294631 0.205703
3 0.234662 0.296351
4 0.185732 0.305119
5 0.134395 0.261630
6 0.081856 0.205005
7 0.043793 0.145900
8 0.000000 0.088712
9 0.000000 0.079770
10 0.000000 0.118507

2.3 Bringing it all together

features = encode_peptide(test_spectrum["sequence"], test_spectrum["charge"])
targets = np.stack([test_spectrum["parsed_intensity"]["b"], test_spectrum["parsed_intensity"]["y"][::-1]], axis=1)
spectrum_id = np.full(shape=(targets.shape[0], 1), fill_value=test_spectrum["index"])  # Repeat id for all ions
pd.DataFrame(np.hstack([spectrum_id, features, targets]), columns=["spectrum_id"] + generate_feature_names() + ["b_target",  "y_target"])
spectrum_id peptide_basicity_min peptide_basicity_q1 peptide_basicity_q2 peptide_basicity_q3 peptide_basicity_max peptide_helicity_min peptide_helicity_q1 peptide_helicity_q2 peptide_helicity_q3 ... fragplus1_hydrophobicity fragplus1_pi cterm_basicity cterm_helicity cterm_hydrophobicity cterm_pi length charge b_target y_target
0 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.000000 0.000000
1 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.229717 0.155991
2 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.294631 0.205703
3 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.234662 0.296351
4 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.185732 0.305119
5 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.134395 0.261630
6 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.081856 0.205005
7 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 51.0 32.0 343.0 39.0 22.0 79.0 12.0 2.0 0.043793 0.145900
8 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 80.0 31.0 343.0 39.0 22.0 79.0 12.0 2.0 0.000000 0.088712
9 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 21.0 29.0 343.0 39.0 22.0 79.0 12.0 2.0 0.000000 0.079770
10 5.0 37.0 37.0 37.0 40.0 343.0 39.0 68.0 68.0 68.0 ... 22.0 79.0 343.0 39.0 22.0 79.0 12.0 2.0 0.000000 0.118507

11 rows × 81 columns

The following function applies these steps over a collection of spectra and returns the full feature/target table:

def generate_ml_input(spectra):
    tables = []
    for spectrum in progress.track(spectra.to_dict(orient="records")):
        features = encode_peptide(spectrum["sequence"], spectrum["charge"])
        targets = np.stack([spectrum["parsed_intensity"]["b"], spectrum["parsed_intensity"]["y"][::-1]], axis=1)
        spectrum_id = np.full(shape=(targets.shape[0], 1), fill_value=spectrum["index"])  # Repeat id for all ions
        table = np.hstack([spectrum_id, features, targets])
        tables.append(table)

    full_table = np.vstack(tables)
    spectra_encoded = pd.DataFrame(full_table, columns=["spectrum_id"] + generate_feature_names() + ["b_target",  "y_target"])
    return spectra_encoded

Note that this might take some time, sometimes up to 30 minutes. To skip this step, simple download the file with pre-encoded features and targets, and load in two cells below.

train_val_encoded = generate_ml_input(train_val_spectra)
train_val_encoded.to_feather("fragmentation-nist-humanhcd20160503-parsed-trainval-encoded.feather")

test_encoded = generate_ml_input(test_spectra)
test_encoded.to_feather("fragmentation-nist-humanhcd20160503-parsed-test-encoded.feather")




# Uncomment this step to load pre-encoded features from a file:

#train_val_encoded = pd.read_feather("ftp://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-trainval-encoded.feather")
#test_encoded = pd.read_feather("ftp://ftp.pride.ebi.ac.uk/pub/databases/pride/resources/proteomicsml/fragmentation/nist-humanhcd20160503-parsed-test-encoded.feather")
train_val_encoded
spectrum_id peptide_basicity_min peptide_basicity_q1 peptide_basicity_q2 peptide_basicity_q3 peptide_basicity_max peptide_helicity_min peptide_helicity_q1 peptide_helicity_q2 peptide_helicity_q3 ... fragplus1_hydrophobicity fragplus1_pi cterm_basicity cterm_helicity cterm_hydrophobicity cterm_pi length charge b_target y_target
0 0.0 0.0 37.0 37.0 37.0 191.0 32.0 68.0 68.0 68.0 ... 51.0 32.0 191.0 32.0 0.0 69.0 22.0 2.0 0.000000 0.000000
1 0.0 0.0 37.0 37.0 37.0 191.0 32.0 68.0 68.0 68.0 ... 51.0 32.0 191.0 32.0 0.0 69.0 22.0 2.0 0.094060 0.000000
2 0.0 0.0 37.0 37.0 37.0 191.0 32.0 68.0 68.0 68.0 ... 51.0 32.0 191.0 32.0 0.0 69.0 22.0 2.0 0.180642 0.000000
3 0.0 0.0 37.0 37.0 37.0 191.0 32.0 68.0 68.0 68.0 ... 51.0 32.0 191.0 32.0 0.0 69.0 22.0 2.0 0.204203 0.050476
4 0.0 0.0 37.0 37.0 37.0 191.0 32.0 68.0 68.0 68.0 ... 51.0 32.0 191.0 32.0 0.0 69.0 22.0 2.0 0.233472 0.094835
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
3321136 398372.0 81.0 104.0 104.0 153.0 343.0 39.0 48.5 55.0 55.0 ... 70.0 28.0 343.0 39.0 22.0 79.0 8.0 3.0 0.000000 0.180938
3321137 398372.0 81.0 104.0 104.0 153.0 343.0 39.0 48.5 55.0 55.0 ... 98.0 31.0 343.0 39.0 22.0 79.0 8.0 3.0 0.000000 0.203977
3321138 398372.0 81.0 104.0 104.0 153.0 343.0 39.0 48.5 55.0 55.0 ... 3.0 48.0 343.0 39.0 22.0 79.0 8.0 3.0 0.000000 0.169803
3321139 398372.0 81.0 104.0 104.0 153.0 343.0 39.0 48.5 55.0 55.0 ... 94.0 32.0 343.0 39.0 22.0 79.0 8.0 3.0 0.000000 0.120565
3321140 398372.0 81.0 104.0 104.0 153.0 343.0 39.0 48.5 55.0 55.0 ... 22.0 79.0 343.0 39.0 22.0 79.0 8.0 3.0 0.000000 0.169962

3321141 rows × 81 columns

This is the data we will use for training. Note that each spectrum comprises of multiple lines: One line per b/y-ion couple.

3 Training the model

from sklearn.ensemble import GradientBoostingRegressor

Let’s first try to train a simple model on the train set and evaluate its performance on the test set.

reg =  GradientBoostingRegressor()

X_train = train_val_encoded.drop(columns=["spectrum_id", "b_target",  "y_target"])
y_train = train_val_encoded["y_target"]
X_test = test_encoded.drop(columns=["spectrum_id", "b_target",  "y_target"])
y_test = test_encoded["y_target"]

reg.fit(X_train, y_train)
GradientBoostingRegressor()
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_test_pred = reg.predict(X_test)
np.corrcoef(y_test, y_test_pred)[0][1]
0.7545700864493669

Not terrible. Let’s see if we can do better after hyperparameters optimization. For this, we can use the hyperopt package.

from hyperopt import fmin, hp, tpe, STATUS_OK
def objective(n_estimators):
    # Define algorithm
    reg =  GradientBoostingRegressor(n_estimators=n_estimators)

    # Fit model
    reg.fit(X_train, y_train)

    # Test model
    y_test_pred = reg.predict(X_test)
    correlation = np.corrcoef(y_test, y_test_pred)[0][1]

    return {'loss': -correlation, 'status': STATUS_OK}
best_params = fmin(
  fn=objective,
  space=10 + hp.randint('n_estimators', 980),
  algo=tpe.suggest,
  max_evals=10,
)
100%|██████████| 10/10 [27:32:34<00:00, 9915.41s/trial, best loss: -0.8238390276421177]   
best_params
{'n_estimators': 912}

Initially, the default value of 100 estimators was used. According to this hyperopt run, using 912 estimators results in a more performant model.

Now we can train the model again with this new hyperparameter value:

reg =  GradientBoostingRegressor(n_estimators=946)

X_train = train_val_encoded.drop(columns=["spectrum_id", "b_target",  "y_target"])
y_train = train_val_encoded["y_target"]
X_test = test_encoded.drop(columns=["spectrum_id", "b_target",  "y_target"])
y_test = test_encoded["y_target"]

reg.fit(X_train, y_train)
GradientBoostingRegressor(n_estimators=946)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
y_test_pred = reg.predict(X_test)
np.corrcoef(y_test, y_test_pred)[0][1]
0.8245284609614351

Much better already. To get a more accurate view of the model performance, we should calculate the correlation per spectrum, instead of across the full dataset:

prediction_df_y = pd.DataFrame({
    "spectrum_id": test_encoded["spectrum_id"],
    "target_y": y_test,
    "prediction_y": y_test_pred,
})
prediction_df_y
spectrum_id target_y prediction_y
0 9.0 0.000000 -0.001765
1 9.0 0.000000 -0.001524
2 9.0 0.000000 -0.000848
3 9.0 0.000000 0.000110
4 9.0 0.000000 0.004115
... ... ... ...
367683 398369.0 0.000000 0.157859
367684 398369.0 0.224074 0.195466
367685 398369.0 0.283664 0.160815
367686 398369.0 0.185094 0.138235
367687 398369.0 0.192657 0.223232

367688 rows × 3 columns

corr_y = prediction_df_y.groupby("spectrum_id").corr().iloc[::2]['prediction_y']
corr_y.index = corr_y.index.droplevel(1)
corr_y = corr_y.reset_index().rename(columns={"prediction_y": "correlation"})
corr_y
spectrum_id correlation
0 9.0 0.839347
1 16.0 0.806933
2 39.0 0.113225
3 95.0 -0.078798
4 140.0 0.744942
... ... ...
27031 398328.0 0.856521
27032 398341.0 0.358528
27033 398342.0 0.839326
27034 398368.0 0.452030
27035 398369.0 0.204123

27036 rows × 2 columns

Median correlation:

corr_y["correlation"].median()
0.8328406567933057

Correlation distribution:

import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("whitegrid")
sns.catplot(
    data=corr_y, x="correlation",
    fliersize=1,
    kind="box", aspect=4, height=2
)
plt.show()

Not bad! With some more hyperparameter optimization (optimizing only the number of trees is a bit crude) a lot more performance gains could be made. Take a look at the Scikit Learn documentation to learn more about the various hyperparameters for the GradientBoostingRegressor. Alternatively, you could switch to the XGBoost algorithm, which is currently used by MS²PIP.

And of course, this model can only predict y-ion intensities. You can repeat the training and optimization steps to train a model for b-ion intensities.

Good luck!

References

Friedman, Jerome H. 2002. “Stochastic Gradient Boosting.” Computational Statistics and Data Analysis 38 (4): 367–78. https://doi.org/10.1016/s0167-9473(01)00065-2.
Gabriels, Ralf, Lennart Martens, and Sven Degroeve. 2019. “Updated MS²PIP Web Server Delivers Fast and Accurate MS² Peak Intensity Prediction for Multiple Fragmentation Methods, Instruments and Labeling Techniques.” NUCLEIC ACIDS RESEARCH 47 (W1): W295–99. https://doi.org/10.1093/nar/gkz299.