Predicting CCS values for TIMS data

Author

Affiliations

Robbin Bouwmeester

VIB-UGent Center for Medical Biotechnology, VIB, Belgium

Department of Biomolecular Medicine, Ghent University, Belgium

Published

September 21, 2022

Introduction

Ion mobility is a technique to separate ionized analytes based on their size, shape, and physicochemical properties. Initially the techniques for ion mobility propelled the ions with an electric field through a cell with inert gas. The ions collide with the inert gas without fragmentation. Separation is achieved by propelling the ions faster or slower in the electric field (i.e., based on their charge) and are slowed down by the collisions with the gas (i.e., based on shape and size). Trapped ion mobility (TIMS) reverses this operation by trapping the ions in an electric field and forcing them forward by collision with the gas. From any of the different ion mobility techniques you are able to derive the collisional cross section (CCS) in Angstrom squared. In this notebook you can follow a short tutorial on how to train a Machine Learning model for the prediction of these CCS values.

import pandas as pd
from matplotlib import pyplot as plt
from collections import Counter
from scipy.stats import pearsonr

## Use 0.0 as the volume for non-amino acids that may still be present in data and might otherwise cause problemsvol_dict = {"A" : 88.6,
            "B" : 0.0,
            "O" : 0.0,
            "X" : 0.0,
            "J" : 0.0,
            "R" : 173.4,
            "N" : 114.1,
            "D" : 111.1,
            "C" : 108.5,
            "Q" : 143.8,
            "E" : 138.4,
            "G" : 60.1,
            "H" : 153.2,
            "I" : 166.7,
            "L" : 166.7,
            "K" : 168.6,
            "M" : 162.9,
            "F" : 189.9,
            "P" : 112.7,
            "S" : 89.0,
            "T" : 116.1,
            "W" : 227.8,
            "Y" : 193.6,
            "V" : 140}

aa_to_pos = dict(zip(vol_dict.keys(),range(len(vol_dict.keys()))))

Data reading and preparation

Read the training data from Meier et al.

ccs_df = pd.read_csv("https://github.com/ProteomicsML/ProteomicsML/blob/main/datasets/ionmobility/Meier_IM_CCS/combined_sm.zip?raw=true", compression="zip", index_col=0)

Execute the cell below to read a smaller data set from Van Puyvelde et al.. Remove all the “#” to read this smaller data set. On for example colab it is recommended to load this smaller data set. Please do note that the description is based on the larger data set. It is expected that more complex models do not benefit at the same rate from the smaller data set (e.g., the deep learning network). Hans Vissers from Waters analyzed this traveling wave IM data:

#ccs_df = pd.read_csv(
#    "https://github.com/ProteomicsML/ProteomicsML/raw/main/datasets/ionmobility/VanPuyvelde_TWIMS_CCS/TWIMSpeptideCCS.tsv.gz",
#    compression="gzip",
#    low_memory=False,
#    sep="\t"
#)

A small summarization of the data that was just read:

ccs_df.describe()

	Charge	Mass	Intensity	Retention time	CCS
count	718917.000000	718917.000000	7.189170e+05	718917.000000	718917.000000
mean	2.376747	1829.771049	6.716163e+05	300.215311	475.545205
std	0.582843	606.256496	2.139819e+06	940.711797	109.083740
min	2.000000	696.428259	2.790800e+02	0.004795	275.418854
25%	2.000000	1361.766700	5.405700e+04	28.260000	392.076630
50%	2.000000	1729.834520	1.655000e+05	50.624000	454.656281
75%	3.000000	2189.009920	5.357000e+05	84.241000	534.702698
max	4.000000	4599.284130	2.481000e+08	6897.700000	1118.786133

ccs_df

	Modified sequence	Charge	Mass	Intensity	Retention time	CCS	PT
0	_(ac)AAAAAAAAAAGAAGGR_	2	1239.63200	149810.0	70.140	409.092529	False
1	_(ac)AAAAAAAAEQQSSNGPVKK_	2	1810.91734	21349.0	19.645	481.229248	True
2	_(ac)AAAAAAAGAAGSAAPAAAAGAPGSGGAPSGSQGVLIGDR_	3	3144.55482	194000.0	3947.700	772.098083	False
3	_(ac)AAAAAAAGDSDSWDADAFSVEDPVRK_	2	2634.18340	6416400.0	94.079	573.213196	False
4	_(ac)AAAAAAAGDSDSWDADAFSVEDPVRK_	3	2634.18340	5400600.0	94.841	635.000549	False
...	...	...	...	...	...	...	...
718912	_YYYVCQYCPAGNNM(ox)NR_	2	2087.82880	131230.0	21.753	461.667145	True
718913	_YYYVCQYCPAGNWANR_	2	2083.86690	84261.0	28.752	459.721191	True
718914	_YYYVPADFVEYEK_	2	1684.76609	382810.0	92.273	436.103699	False
718915	_YYYVQNVYTPVDEHVYPDHR_	3	2556.17099	30113.0	26.381	580.297058	True
718916	_YYYVQNVYTPVDEHVYPDHR_	4	2556.17099	33682.0	26.381	691.901123	True

718917 rows × 7 columns

Prepare the data to not contain any "_" characters or modifications in between [ ]:

# Strip "_" from sequence
ccs_df["sequence"] = ccs_df["Modified sequence"].str.strip("_")

# Strip everything between "()" and "[]" from sequence
ccs_df["sequence"] = ccs_df["sequence"].str.replace(r"[\(\[].*?[\)\]]", "", regex=True)

Count the occurence of amino acids, those that did not get detected; replace with 0

# Apply counter to each sequence, fill NA with 0.0, make matrix from counts
X_matrix_count = pd.DataFrame(ccs_df["sequence"].apply(Counter).to_dict()).fillna(0.0).T

X_matrix_count

	A	G	R	E	Q	S	N	P	V	K	L	I	D	W	F	M	T	C	Y	H
0	12.0	3.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
1	8.0	1.0	0.0	1.0	2.0	2.0	1.0	1.0	1.0	2.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
2	17.0	9.0	1.0	0.0	1.0	4.0	0.0	3.0	1.0	0.0	1.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	0.0
3	9.0	1.0	1.0	1.0	0.0	3.0	0.0	1.0	2.0	1.0	0.0	0.0	5.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
4	9.0	1.0	1.0	1.0	0.0	3.0	0.0	1.0	2.0	1.0	0.0	0.0	5.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
718912	1.0	1.0	1.0	0.0	1.0	0.0	3.0	1.0	1.0	0.0	0.0	0.0	0.0	0.0	0.0	1.0	0.0	2.0	4.0	0.0
718913	2.0	1.0	1.0	0.0	1.0	0.0	2.0	1.0	1.0	0.0	0.0	0.0	0.0	1.0	0.0	0.0	0.0	2.0	4.0	0.0
718914	1.0	0.0	0.0	2.0	0.0	0.0	0.0	1.0	2.0	1.0	0.0	0.0	1.0	0.0	1.0	0.0	0.0	0.0	4.0	0.0
718915	0.0	0.0	1.0	1.0	1.0	0.0	1.0	2.0	4.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	1.0	0.0	5.0	2.0
718916	0.0	0.0	1.0	1.0	1.0	0.0	1.0	2.0	4.0	0.0	0.0	0.0	2.0	0.0	0.0	0.0	1.0	0.0	5.0	2.0

718917 rows × 20 columns

A fairly rudimentary technique is to use the volume of each amino acid and sum these volumes:

def to_predicted_ccs(row):
    vol_sum = sum([vol_dict[k]*v for k,v in row.to_dict().items()])
    return vol_sum

ccs_df["predicted_CCS_vol_based"] = X_matrix_count.apply(to_predicted_ccs,axis=1)

Lets see the results:

if len(ccs_df.index) < 1e4:
    set_alpha = 0.2
    set_size = 3
else:
    set_alpha = 0.05
    set_size = 1

for c in range(2,5):
    plt.scatter(
        ccs_df.loc[ccs_df["Charge"]==c,"CCS"],
        ccs_df.loc[ccs_df["Charge"]==c,"predicted_CCS_vol_based"],
        alpha=set_alpha,
        s=set_size,
        label="Z="+str(c)
    )

legend = plt.legend()

for lh in legend.legendHandles:
    lh.set_sizes([25])
    lh.set_alpha(1)

plt.xlabel("Observed CCS (Angstrom^2)")
plt.xlabel("Predicted CCS (Angstrom^2)")

plt.show()

Clear correlation, but seems we need to change the intercepts of each curve and make separate predictions for each peptide charge state. In addition to these observations it seems that higher charge states have higher errors. This likely influenced by a large part by the relation between higher charge states and longer peptides. These longer peptides can deviate more from each other in terms of structures (and CCS). Instead of spending more time on this, lets have a look at a more ML-based approach.

Training a linear regression model for CCS prediction

from sklearn.linear_model import LinearRegression
import numpy as np
import random

In this section we will fit a linear regression model. This model is only able to fit a linear function between the features (sequence) and target (CCS). This linear model can be expressed as the following equation:

$ Y = _0 + _1 X$

Where $Y$ is a vector (/list) of all CCS values and X a matrix (/2-dimensional list) of all the amino acids counts. The intercept and weights of each features are learned so the predicted value ($\hat{Y}$) is close to the observed outcome ($Y$). What is considered close and how this closeness between predictions and observations are minimized is not further discussed here. However, there is a rich amount of information available on the internet (e.g., https://www.coursera.org/learn/machine-learning and scikit-learn in particular at https://scikit-learn.org/stable/.

First, we will split the matrix into 90% training peptides and 10% testing peptides. These testing peptides are very valuable in estimating model performance. Since the model has not seen these sequences before it cannot overfit on these particular examples.

# Get all the index identifiers
all_idx = list(X_matrix_count.index)
random.seed(42)

# Shuffle the index identifiers so we can randomly split them in a testing and training set
random.shuffle(all_idx)

# Select 90 % for training and the remaining 10 % for testing
train_idx = all_idx[0:int(len(all_idx)*0.9)]
test_idx = all_idx[int(len(all_idx)*0.9):]

# Get the train and test indices and point to new variables
ccs_df_train = ccs_df.loc[train_idx,:]
ccs_df_test = ccs_df.loc[test_idx,:]

# Also for the feature matrix get the train and test indices
X_matrix_count_train = X_matrix_count.loc[train_idx,:]
X_matrix_count_test = X_matrix_count.loc[test_idx,:]

Now let’s start training the models. Although we could encode the charge as a feature here we separate all models to counter any charge to composition specific patterns.

# Initialize a model object
linear_model_z2 = LinearRegression()

# Fit the initialized model object to our training data (only charge 2)
linear_model_z2.fit(
    X=X_matrix_count_train.loc[ccs_df_train["Charge"]==2,:],
    y=ccs_df_train.loc[ccs_df_train["Charge"]==2,"CCS"]
)

# Repeat for the other two charge states
linear_model_z3 = LinearRegression()
linear_model_z3.fit(
    X=X_matrix_count_train.loc[ccs_df_train["Charge"]==3,:],
    y=ccs_df_train.loc[ccs_df_train["Charge"]==3,"CCS"]
)

linear_model_z4 = LinearRegression()
linear_model_z4.fit(
    X=X_matrix_count_train.loc[ccs_df_train["Charge"]==4,:],
    y=ccs_df_train.loc[ccs_df_train["Charge"]==4,"CCS"]
)

LinearRegression()

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

Now we can have a look at the coefficients $\beta_1$ learned. These should be highly correlated with the previous experimentally determined volumetric observations for each amino acid:

# Scatter plot the coefficients of each amino acid against their experimentally determined volumes
plt.scatter(
    linear_model_z2.coef_,
    [vol_dict[v] for v in X_matrix_count.columns]
)

# Plot a diagonal line we expect the points to be on
plt.plot(
    [6.0,26.0],
    [60.0,260],
    c="grey",
    zorder=0
)

# Annotate each point with their respective amino acids
for v,x,y in zip(X_matrix_count.columns,
                 linear_model_z2.coef_,
                 [vol_dict[v] for v in X_matrix_count.columns]):

    plt.annotate(v,(x+0.1,y+5.0))

plt.show()

Observations are very similar. There are differences that could be cause by a multitude of reasons. For example, the difference between volumetric observations in the CCS cell is different or being part of a polypeptide chain changes the volume of the amino acid.

Next we will plot the predictions of the test set and compare them with observational data. Note that we apply each charge model separately.

if len(ccs_df.index) < 1e4:
    set_alpha = 0.2
    set_size = 3
else:
    set_alpha = 0.05
    set_size = 1

# Scatter plot the observations on the test set against the predictions on the same set (z=2)
plt.scatter(
    linear_model_z2.predict(X=X_matrix_count_test.loc[ccs_df["Charge"]==2,:]),
    ccs_df_test.loc[ccs_df["Charge"]==2,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=2"
)

# Scatter plot the observations on the test set against the predictions on the same set (z=3)
plt.scatter(
    linear_model_z3.predict(X=X_matrix_count_test.loc[ccs_df["Charge"]==3,:]),
    ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=3"
)

# Scatter plot the observations on the test set against the predictions on the same set (z=4)
plt.scatter(
    linear_model_z4.predict(X=X_matrix_count_test.loc[ccs_df["Charge"]==4,:]),
    ccs_df_test.loc[ccs_df["Charge"]==4,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=4"
)

# Plot a diagonal the points should be one
plt.plot([300,1100],[300,1100],c="grey")

# Add a legend for the charge states
legend = plt.legend()

# Make sure the legend labels are visible and big enough
for lh in legend.legendHandles:
    lh.set_sizes([25])
    lh.set_alpha(1)

# Get the predictions and calculate performance metrics
predictions = linear_model_z2.predict(X_matrix_count_test.loc[ccs_df["Charge"]==3,:])
mare = round(sum((abs(predictions-ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])/ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])*100)/len(predictions),3)
pcc = round(pearsonr(predictions,ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])[0],3)
perc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])/ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])*100,95)*2,2)

plt.title(f"Linear model - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}% (z3 model for z3 observations)")

ax = plt.gca()
ax.set_aspect('equal')

plt.xlabel("Observed CCS (^2)")
plt.ylabel("Predicted CCS (^2)")

plt.show()

It is clear that the predictions and observations are on the diagonal. This means that they are very similar. However, there are still some differences between observations and predictions.

In the previous example, we trained models for charge state separately. This is slightly inconvenient and other charge states might still be able to provide useful training examples. As long as the model corrects for the right charge state of course. In the next example we add charge state to the feature matrix. The linear model should be (partially…) able to account for the charge states of peptides.

# Make a new copy of feature matrix and add charge as a feature
X_matrix_count_charge_train = X_matrix_count_train.copy()
X_matrix_count_charge_train["charge"] = ccs_df_train["Charge"]

X_matrix_count_charge_test = X_matrix_count_test.copy()
X_matrix_count_charge_test["charge"] = ccs_df_test["Charge"]

# Fit the linear model, but this time with the charge as a feature
linear_model = LinearRegression()

linear_model.fit(
    X=X_matrix_count_charge_train,
    y=ccs_df_train.loc[:,"CCS"]
)

LinearRegression()

if len(ccs_df.index) < 1e4:
    set_alpha = 0.2
    set_size = 3
else:
    set_alpha = 0.05
    set_size = 1

# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter(
    linear_model.predict(X=X_matrix_count_charge_test.loc[ccs_df["Charge"]==2,:]),
    ccs_df_test.loc[ccs_df["Charge"]==2,"CCS"],
    alpha=set_alpha,
    s=1,
    label="Z=2"
)

plt.scatter(
    linear_model.predict(X=X_matrix_count_charge_test.loc[ccs_df["Charge"]==3,:]),
    ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"],
    alpha=set_alpha,
    s=1,
    label="Z=3"
)

plt.scatter(
    linear_model.predict(X=X_matrix_count_charge_test.loc[ccs_df["Charge"]==4,:]),
    ccs_df_test.loc[ccs_df["Charge"]==4,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=4"
)

# Plot a diagonal the points should be one
plt.plot([300,1100],[300,1100],c="grey")

legend = plt.legend()

for lh in legend.legendHandles:
    lh.set_sizes([25])
    lh.set_alpha(1)

# Get the predictions and calculate performance metrics
predictions = linear_model.predict(X=X_matrix_count_charge_test)
mare = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
pcc = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
perc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)

plt.title(f"Linear model - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")

ax = plt.gca()
ax.set_aspect('equal')

plt.xlabel("Observed CCS (^2)")
plt.ylabel("Predicted CCS (^2)")

plt.show()

With this model we are capable to predict CCS values for all three charge states (maybe more; be careful with extrapolation). However, it also shows that both z3 and z4 are not optimally predicted. Especially z4 we can probably draw a line manually that provides better performance than the current model. The incapability of the model to correctly predict some of these values is largely due to the linear algorithm. With this algorithm we can only fit “simple” linear relations, but more complex relations are not modeled correctly. In the next section we will fit a non-linear model that is able to capture these complex relations better. However, keep in mind that more complex models are usually also able to overfit data better, resulting in poorer generalization performance.

Training an RF (non-linear) regression model for CCS prediction

In this section we will fit a random forest (RF) regression model. We hope to fit some of the non-linear relations present in the data. The RF algorithm fits multiple decision trees, but what makes these trees different is the random selection of instances (peptides) and/or features (amino acid count). The predictions between the forest of trees can be averaged to obtain a single prediction per peptide (instead of multiple for the same peptide). Later we will see that the algorithm might actually not be suitable for fitting this type of data.

from sklearn.ensemble import RandomForestRegressor

# Make a new copy of feature matrix and add charge as a feature
X_matrix_count_charge_train = X_matrix_count_train.copy()
X_matrix_count_charge_train["charge"] = ccs_df_train["Charge"]

X_matrix_count_charge_test = X_matrix_count_test.copy()
X_matrix_count_charge_test["charge"] = ccs_df_test["Charge"]

# Initialize a RF object, note the hyperparameters that the model will follow
rf_model = RandomForestRegressor(
                max_depth=20,
                n_estimators=50,
                n_jobs=-1
)

# Fit the RF model
rf_model.fit(
    X=X_matrix_count_charge_train,
    y=ccs_df_train.loc[:,"CCS"]
)

RandomForestRegressor(max_depth=20, n_estimators=50, n_jobs=-1)

if len(ccs_df.index) < 1e4:
    set_alpha = 0.2
    set_size = 3
else:
    set_alpha = 0.05
    set_size = 1

# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter(
    rf_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==2,:]),
    ccs_df_test.loc[ccs_df_test["Charge"]==2,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=2"
)

plt.scatter(
    rf_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==3,:]),
    ccs_df_test.loc[ccs_df_test["Charge"]==3,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=3"
)

plt.scatter(
    rf_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==4,:]),
    ccs_df_test.loc[ccs_df_test["Charge"]==4,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=4"
)

# Plot a diagonal the points should be one
plt.plot([300,1100],[300,1100],c="grey")

legend = plt.legend()

for lh in legend.legendHandles:
    lh.set_sizes([25])
    lh.set_alpha(1)

# Get the predictions and calculate performance metrics
predictions = rf_model.predict(X=X_matrix_count_charge_test)
mare = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
pcc = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
perc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)

plt.title(f"RF - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")

ax = plt.gca()
ax.set_aspect('equal')

plt.xlabel("Observed CCS (^2)")
plt.ylabel("Predicted CCS (^2)")

plt.show()

As can be observed the problem with z=4 splitting up is gone, probably due to the capability of RF to fit non-linear relations. However, we see quite a large deviation on the diagonal. One of the major causes of this problem is the exclusion of amino acid counts for the decision trees. Although this is fundamental to the inner workings of RF, it means that we cannot take the excluded amino acids into account and these values are likely to be replaced by average expected volume to other (non-excluded) amino acids. RF performs very well when features correlate, and predictions are not fully dependent on the inclusion of all features. Next we will look at a decision tree algorithm (XGBoost) that does not rely on the exclusion of features.

PS note that you might be able to fit a much better model by using a much larger number of trees, but overall the problem largely remains, and it is better to choose an algorithm that respects/fits your data best.

Training a XGBoost (non-linear) regression model for CCS prediction

In this section we will fit a XGBoost regression model. This algorithm works by training a sequence of underfitted models. Each model in the sequence receives the output of the previous decision tree models. This combination of trees allows to fit the data well without greatly overfitting it.

from xgboost import XGBRegressor

# Make a new copy of feature matrix and add charge as a feature
X_matrix_count_charge_train = X_matrix_count_train.copy()
X_matrix_count_charge_train["charge"] = ccs_df_train["Charge"]

X_matrix_count_charge_test = X_matrix_count_test.copy()
X_matrix_count_charge_test["charge"] = ccs_df_test["Charge"]

# Initialize the XGB object
xgb_model = XGBRegressor(
                max_depth=12,
                n_estimators=250
)

# Fit the XGB model
xgb_model.fit(
    X_matrix_count_charge_train,
    ccs_df_train.loc[:,"CCS"]
)

XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=12, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=250, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...)

if len(ccs_df.index) < 1e4:
    set_alpha = 0.2
    set_size = 3
else:
    set_alpha = 0.05
    set_size = 1

# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter(
    ccs_df_test.loc[ccs_df_test["Charge"]==2,"CCS"],
    xgb_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==2,:]),
    alpha=set_alpha,
    s=set_size,
    label="Z=2")

plt.scatter(
    xgb_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==3,:]),
    ccs_df_test.loc[ccs_df_test["Charge"]==3,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=3"
)

plt.scatter(
    xgb_model.predict(X=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==4,:]),
    ccs_df_test.loc[ccs_df_test["Charge"]==4,"CCS"],
    alpha=set_alpha,
    s=set_size,
    label="Z=4"
)

# Plot a diagonal the points should be one
plt.plot([300,1100],[300,1100],c="grey")

legend = plt.legend()

for lh in legend.legendHandles:
    lh.set_sizes([25])
    lh.set_alpha(1)

# Get the predictions and calculate performance metrics
predictions = xgb_model.predict(X_matrix_count_charge_test)
mare = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
pcc = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
perc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)

plt.title(f"XGBoost - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")

ax = plt.gca()
ax.set_aspect('equal')

plt.xlabel("Observed CCS (^2)")
plt.ylabel("Predicted CCS (^2)")

plt.show()

Training a deep learning LSTM model for CCS prediction

The deviation on the diagonal has been decreased significantly. But… A decision tree based algorithm is usually not the best for a regression model. Since the target data is continuous a model that can respect this structure is likely to perform better. Furthermore, up till now we simply counted amino acids, but structure is important. So to get the most out of the data we need to use the exact positions of amino acids.

Also… We have a lot of data it makes sense to use deep learning (DL). DL models are usually capable of learning more complex relations than traditional algorithms. Furthormore, for traditional ML algorithms we usually need to engineer features, while DL can usually work directly from raw data. DL is able to construct its own features.

from tensorflow.keras.layers import Dense, concatenate, Input, Bidirectional, LSTM
from tensorflow.keras.models import Model
import tensorflow as tf

As mentioned before, we want to use features that can also tell us something about the potential structure of the peptide. This means we need to take the sequence of the peptide into account and not just the amino acid counts. For this we will use a ‘one-hot encoding’, in this matrix each position in the peptide are the columns (number of columns equals the length of the peptide) and each amino acid per position has its own row (for the standard amino acids this is 20). So as a result we create a matrix that is the length of the peptide by the amount of unique amino acids in the whole data set. For each position we indicate the presence with a ‘1’ and absence with ‘0’. As a result the sum of each columnn is ‘1’ and the sum of the whole matrix equals the length of the peptide.

def aa_seq_to_one_hot(seq,padding_length=60):
    # Although padding is not needed for an LSTM, we might need it if we for example apply a CNN
    # Calculate how much padding is needed
    seq_len = len(seq)
    if seq_len > padding_length:
        seq = seq[0:padding_length]
        seq_len = len(seq)

    # Add padding for peptides that are too short
    padding = "".join(["X"] * (padding_length - len(seq)))
    seq = seq + padding

    # Initialize all feature matrix
    matrix_hc = np.zeros(
        (len(aa_to_pos.keys()), len(seq)), dtype=np.int8)

    # Fill the one-hot matrix, when we encounter an 'X' it should be the end of the sequence
    for idx,aa in enumerate(seq):
        if aa == "X":
            break
        matrix_hc[aa_to_pos[aa],idx] = 1

    return matrix_hc

# Calculate the one-hot matrices and stack them
# Result is a 3D matrix where the first dimension is each peptide, and then the last two dims are the one-hot matrix
one_hot_encoded_train = np.stack(ccs_df_train["sequence"].apply(aa_seq_to_one_hot).values)
one_hot_encoded_test = np.stack(ccs_df_test["sequence"].apply(aa_seq_to_one_hot).values)

if len(ccs_df.index) < 1e4:
    epochs = 100
    num_lstm = 12
    batch_size = 128
else:
    batch_size = 1024
    epochs = 10
    num_lstm = 64

batch_size = 128
v_split    = 0.1
optimizer  = "adam"
loss       = "mean_squared_error"

# The architecture chosen consists of two inputs: (1) the one-hot matrix and (2) the charge
# The first part is a biderectional LSTM (a), in paralel we have dense layers containing the charge (b)
# Both a and b are concatenated to go through several dense layers (c)
input_a = Input(shape=(None, one_hot_encoded_train.shape[2]))
a = Bidirectional(LSTM(num_lstm,return_sequences=True))(input_a)
a = Bidirectional(LSTM(num_lstm))(a)
a = Model(inputs=input_a, outputs=a)

input_b = Input(shape=(1,))
b = Dense(5, activation="relu")(input_b)
b = Model(inputs=input_b, outputs=b)

c = concatenate([a.output, b.output],axis=-1)

c = Dense(64, activation="relu")(c)
c = Dense(32, activation="relu")(c)
c = Dense(1, activation="relu")(c)

# Create the model with specified inputs and outputs
model = Model(inputs=[a.input, b.input], outputs=c)

model.compile(optimizer=optimizer, loss=loss)

# Fit the model on the training data
history = model.fit(
            (one_hot_encoded_train,ccs_df_train.loc[:,"Charge"]),
            ccs_df_train.loc[:,"CCS"],
            epochs=epochs,
            batch_size=batch_size,
            validation_split=v_split
)

Epoch 1/10
4550/4550 [==============================] - 108s 23ms/step - loss: 5536.7256 - val_loss: 467.1984
Epoch 2/10
4550/4550 [==============================] - 102s 22ms/step - loss: 436.4136 - val_loss: 403.9187
Epoch 3/10
4550/4550 [==============================] - 100s 22ms/step - loss: 395.7551 - val_loss: 384.5470
Epoch 4/10
4550/4550 [==============================] - 103s 23ms/step - loss: 376.4315 - val_loss: 382.0145
Epoch 5/10
4550/4550 [==============================] - 102s 23ms/step - loss: 364.8819 - val_loss: 395.8338
Epoch 6/10
4550/4550 [==============================] - 106s 23ms/step - loss: 357.4092 - val_loss: 355.8185
Epoch 7/10
4550/4550 [==============================] - 104s 23ms/step - loss: 342.5216 - val_loss: 312.9571
Epoch 8/10
4550/4550 [==============================] - 104s 23ms/step - loss: 275.4517 - val_loss: 274.8963
Epoch 9/10
4550/4550 [==============================] - 110s 24ms/step - loss: 253.2955 - val_loss: 252.0118
Epoch 10/10
4550/4550 [==============================] - 105s 23ms/step - loss: 240.6064 - val_loss: 260.1613

# Predict CCS values test set
ccs_df_test["LSTM_predictions"] = model.predict((one_hot_encoded_test,ccs_df_test.loc[:,"Charge"]))

2247/2247 [==============================] - 20s 8ms/step

if len(ccs_df.index) < 1e4:
    set_alpha = 0.2
    set_size = 3
else:
    set_alpha = 0.05
    set_size = 1

# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter(
    ccs_df_test.loc[ccs_df_test["Charge"]==2,"CCS"],
    ccs_df_test.loc[ccs_df_test["Charge"]==2,"LSTM_predictions"],
    alpha=set_alpha,
    s=set_size,
    label="Z=2"
)

plt.scatter(
    ccs_df_test.loc[ccs_df_test["Charge"]==3,"CCS"],
    ccs_df_test.loc[ccs_df_test["Charge"]==3,"LSTM_predictions"],
    alpha=set_alpha,
    s=set_size,
    label="Z=3"
)

plt.scatter(
    ccs_df_test.loc[ccs_df_test["Charge"]==4,"CCS"],
    ccs_df_test.loc[ccs_df_test["Charge"]==4,"LSTM_predictions"],
    alpha=set_alpha,
    s=set_size,
    label="Z=4"
)

# Plot a diagonal the points should be one
plt.plot([300,1100],[300,1100],c="grey")

legend = plt.legend()

for lh in legend.legendHandles:
    lh.set_sizes([25])
    lh.set_alpha(1)

# Get the predictions and calculate performance metrics
predictions = ccs_df_test["LSTM_predictions"]
mare = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
pcc = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
perc_95 = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)

plt.title(f"LSTM - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")

ax = plt.gca()
ax.set_aspect('equal')

plt.xlabel("Observed CCS (^2)")
plt.ylabel("Predicted CCS (^2)")

plt.show()

It is clear that the performance of this model is much better. But… Performance can be improved a lot more by for example tuning hyperparameters like the network architecture or number of epochs.

Hope you enjoyed this tutorial! Feel free to edit it and make a pull request!