import pandas as pd
from matplotlib import pyplot as plt
from collections import Counter
from scipy.stats import pearsonr
## Use 0.0 as the volume for non-amino acids that may still be present in data and might otherwise cause problemsvol_dict = {"A" : 88.6,
"B" : 0.0,
"O" : 0.0,
"X" : 0.0,
"J" : 0.0,
"R" : 173.4,
"N" : 114.1,
"D" : 111.1,
"C" : 108.5,
"Q" : 143.8,
"E" : 138.4,
"G" : 60.1,
"H" : 153.2,
"I" : 166.7,
"L" : 166.7,
"K" : 168.6,
"M" : 162.9,
"F" : 189.9,
"P" : 112.7,
"S" : 89.0,
"T" : 116.1,
"W" : 227.8,
"Y" : 193.6,
"V" : 140}
= dict(zip(vol_dict.keys(),range(len(vol_dict.keys())))) aa_to_pos
Predicting CCS values for TIMS data
Introduction
Ion mobility is a technique to separate ionized analytes based on their size, shape, and physicochemical properties. Initially the techniques for ion mobility propelled the ions with an electric field through a cell with inert gas. The ions collide with the inert gas without fragmentation. Separation is achieved by propelling the ions faster or slower in the electric field (i.e., based on their charge) and are slowed down by the collisions with the gas (i.e., based on shape and size). Trapped ion mobility (TIMS) reverses this operation by trapping the ions in an electric field and forcing them forward by collision with the gas. From any of the different ion mobility techniques you are able to derive the collisional cross section (CCS) in Angstrom squared. In this notebook you can follow a short tutorial on how to train a Machine Learning model for the prediction of these CCS values.
Data reading and preparation
Read the training data from Meier et al.
= pd.read_csv("https://github.com/ProteomicsML/ProteomicsML/blob/main/datasets/ionmobility/Meier_IM_CCS/combined_sm.zip?raw=true", compression="zip", index_col=0) ccs_df
Execute the cell below to read a smaller data set from Van Puyvelde et al.. Remove all the “#” to read this smaller data set. On for example colab it is recommended to load this smaller data set. Please do note that the description is based on the larger data set. It is expected that more complex models do not benefit at the same rate from the smaller data set (e.g., the deep learning network). Hans Vissers from Waters analyzed this traveling wave IM data:
#ccs_df = pd.read_csv(
# "https://github.com/ProteomicsML/ProteomicsML/raw/main/datasets/ionmobility/VanPuyvelde_TWIMS_CCS/TWIMSpeptideCCS.tsv.gz",
# compression="gzip",
# low_memory=False,
# sep="\t"
#)
A small summarization of the data that was just read:
ccs_df.describe()
Charge | Mass | Intensity | Retention time | CCS | |
---|---|---|---|---|---|
count | 718917.000000 | 718917.000000 | 7.189170e+05 | 718917.000000 | 718917.000000 |
mean | 2.376747 | 1829.771049 | 6.716163e+05 | 300.215311 | 475.545205 |
std | 0.582843 | 606.256496 | 2.139819e+06 | 940.711797 | 109.083740 |
min | 2.000000 | 696.428259 | 2.790800e+02 | 0.004795 | 275.418854 |
25% | 2.000000 | 1361.766700 | 5.405700e+04 | 28.260000 | 392.076630 |
50% | 2.000000 | 1729.834520 | 1.655000e+05 | 50.624000 | 454.656281 |
75% | 3.000000 | 2189.009920 | 5.357000e+05 | 84.241000 | 534.702698 |
max | 4.000000 | 4599.284130 | 2.481000e+08 | 6897.700000 | 1118.786133 |
ccs_df
Modified sequence | Charge | Mass | Intensity | Retention time | CCS | PT | |
---|---|---|---|---|---|---|---|
0 | _(ac)AAAAAAAAAAGAAGGR_ | 2 | 1239.63200 | 149810.0 | 70.140 | 409.092529 | False |
1 | _(ac)AAAAAAAAEQQSSNGPVKK_ | 2 | 1810.91734 | 21349.0 | 19.645 | 481.229248 | True |
2 | _(ac)AAAAAAAGAAGSAAPAAAAGAPGSGGAPSGSQGVLIGDR_ | 3 | 3144.55482 | 194000.0 | 3947.700 | 772.098083 | False |
3 | _(ac)AAAAAAAGDSDSWDADAFSVEDPVRK_ | 2 | 2634.18340 | 6416400.0 | 94.079 | 573.213196 | False |
4 | _(ac)AAAAAAAGDSDSWDADAFSVEDPVRK_ | 3 | 2634.18340 | 5400600.0 | 94.841 | 635.000549 | False |
... | ... | ... | ... | ... | ... | ... | ... |
718912 | _YYYVCQYCPAGNNM(ox)NR_ | 2 | 2087.82880 | 131230.0 | 21.753 | 461.667145 | True |
718913 | _YYYVCQYCPAGNWANR_ | 2 | 2083.86690 | 84261.0 | 28.752 | 459.721191 | True |
718914 | _YYYVPADFVEYEK_ | 2 | 1684.76609 | 382810.0 | 92.273 | 436.103699 | False |
718915 | _YYYVQNVYTPVDEHVYPDHR_ | 3 | 2556.17099 | 30113.0 | 26.381 | 580.297058 | True |
718916 | _YYYVQNVYTPVDEHVYPDHR_ | 4 | 2556.17099 | 33682.0 | 26.381 | 691.901123 | True |
718917 rows × 7 columns
Prepare the data to not contain any "_" characters or modifications in between [ ]:
# Strip "_" from sequence
"sequence"] = ccs_df["Modified sequence"].str.strip("_")
ccs_df[
# Strip everything between "()" and "[]" from sequence
"sequence"] = ccs_df["sequence"].str.replace(r"[\(\[].*?[\)\]]", "", regex=True) ccs_df[
Count the occurence of amino acids, those that did not get detected; replace with 0
# Apply counter to each sequence, fill NA with 0.0, make matrix from counts
= pd.DataFrame(ccs_df["sequence"].apply(Counter).to_dict()).fillna(0.0).T X_matrix_count
X_matrix_count
A | G | R | E | Q | S | N | P | V | K | L | I | D | W | F | M | T | C | Y | H | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 12.0 | 3.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
1 | 8.0 | 1.0 | 0.0 | 1.0 | 2.0 | 2.0 | 1.0 | 1.0 | 1.0 | 2.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
2 | 17.0 | 9.0 | 1.0 | 0.0 | 1.0 | 4.0 | 0.0 | 3.0 | 1.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
3 | 9.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | 5.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
4 | 9.0 | 1.0 | 1.0 | 1.0 | 0.0 | 3.0 | 0.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | 5.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
718912 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 3.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 2.0 | 4.0 | 0.0 |
718913 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 2.0 | 4.0 | 0.0 |
718914 | 1.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 2.0 | 1.0 | 0.0 | 0.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 0.0 | 4.0 | 0.0 |
718915 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 2.0 | 4.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 2.0 |
718916 | 0.0 | 0.0 | 1.0 | 1.0 | 1.0 | 0.0 | 1.0 | 2.0 | 4.0 | 0.0 | 0.0 | 0.0 | 2.0 | 0.0 | 0.0 | 0.0 | 1.0 | 0.0 | 5.0 | 2.0 |
718917 rows × 20 columns
A fairly rudimentary technique is to use the volume of each amino acid and sum these volumes:
def to_predicted_ccs(row):
= sum([vol_dict[k]*v for k,v in row.to_dict().items()])
vol_sum return vol_sum
"predicted_CCS_vol_based"] = X_matrix_count.apply(to_predicted_ccs,axis=1) ccs_df[
Lets see the results:
if len(ccs_df.index) < 1e4:
= 0.2
set_alpha = 3
set_size else:
= 0.05
set_alpha = 1
set_size
for c in range(2,5):
plt.scatter("Charge"]==c,"CCS"],
ccs_df.loc[ccs_df["Charge"]==c,"predicted_CCS_vol_based"],
ccs_df.loc[ccs_df[=set_alpha,
alpha=set_size,
s="Z="+str(c)
label
)
= plt.legend()
legend
for lh in legend.legendHandles:
25])
lh.set_sizes([1)
lh.set_alpha(
"Observed CCS (Angstrom^2)")
plt.xlabel("Predicted CCS (Angstrom^2)")
plt.xlabel(
plt.show()
Clear correlation, but seems we need to change the intercepts of each curve and make separate predictions for each peptide charge state. In addition to these observations it seems that higher charge states have higher errors. This likely influenced by a large part by the relation between higher charge states and longer peptides. These longer peptides can deviate more from each other in terms of structures (and CCS). Instead of spending more time on this, lets have a look at a more ML-based approach.
Training a linear regression model for CCS prediction
from sklearn.linear_model import LinearRegression
import numpy as np
import random
In this section we will fit a linear regression model. This model is only able to fit a linear function between the features (sequence) and target (CCS). This linear model can be expressed as the following equation:
$ Y = _0 + _1 X$
Where \(Y\) is a vector (/list) of all CCS values and X a matrix (/2-dimensional list) of all the amino acids counts. The intercept and weights of each features are learned so the predicted value (\(\hat{Y}\)) is close to the observed outcome (\(Y\)). What is considered close and how this closeness between predictions and observations are minimized is not further discussed here. However, there is a rich amount of information available on the internet (e.g., https://www.coursera.org/learn/machine-learning and scikit-learn in particular at https://scikit-learn.org/stable/.
First, we will split the matrix into 90% training peptides and 10% testing peptides. These testing peptides are very valuable in estimating model performance. Since the model has not seen these sequences before it cannot overfit on these particular examples.
# Get all the index identifiers
= list(X_matrix_count.index)
all_idx 42)
random.seed(
# Shuffle the index identifiers so we can randomly split them in a testing and training set
random.shuffle(all_idx)
# Select 90 % for training and the remaining 10 % for testing
= all_idx[0:int(len(all_idx)*0.9)]
train_idx = all_idx[int(len(all_idx)*0.9):]
test_idx
# Get the train and test indices and point to new variables
= ccs_df.loc[train_idx,:]
ccs_df_train = ccs_df.loc[test_idx,:]
ccs_df_test
# Also for the feature matrix get the train and test indices
= X_matrix_count.loc[train_idx,:]
X_matrix_count_train = X_matrix_count.loc[test_idx,:] X_matrix_count_test
Now let’s start training the models. Although we could encode the charge as a feature here we separate all models to counter any charge to composition specific patterns.
# Initialize a model object
= LinearRegression()
linear_model_z2
# Fit the initialized model object to our training data (only charge 2)
linear_model_z2.fit(=X_matrix_count_train.loc[ccs_df_train["Charge"]==2,:],
X=ccs_df_train.loc[ccs_df_train["Charge"]==2,"CCS"]
y
)
# Repeat for the other two charge states
= LinearRegression()
linear_model_z3
linear_model_z3.fit(=X_matrix_count_train.loc[ccs_df_train["Charge"]==3,:],
X=ccs_df_train.loc[ccs_df_train["Charge"]==3,"CCS"]
y
)
= LinearRegression()
linear_model_z4
linear_model_z4.fit(=X_matrix_count_train.loc[ccs_df_train["Charge"]==4,:],
X=ccs_df_train.loc[ccs_df_train["Charge"]==4,"CCS"]
y )
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
Now we can have a look at the coefficients \(\beta_1\) learned. These should be highly correlated with the previous experimentally determined volumetric observations for each amino acid:
# Scatter plot the coefficients of each amino acid against their experimentally determined volumes
plt.scatter(
linear_model_z2.coef_,for v in X_matrix_count.columns]
[vol_dict[v]
)
# Plot a diagonal line we expect the points to be on
plt.plot(6.0,26.0],
[60.0,260],
[="grey",
c=0
zorder
)
# Annotate each point with their respective amino acids
for v,x,y in zip(X_matrix_count.columns,
linear_model_z2.coef_,for v in X_matrix_count.columns]):
[vol_dict[v]
+0.1,y+5.0))
plt.annotate(v,(x
plt.show()
Observations are very similar. There are differences that could be cause by a multitude of reasons. For example, the difference between volumetric observations in the CCS cell is different or being part of a polypeptide chain changes the volume of the amino acid.
Next we will plot the predictions of the test set and compare them with observational data. Note that we apply each charge model separately.
if len(ccs_df.index) < 1e4:
= 0.2
set_alpha = 3
set_size else:
= 0.05
set_alpha = 1
set_size
# Scatter plot the observations on the test set against the predictions on the same set (z=2)
plt.scatter(=X_matrix_count_test.loc[ccs_df["Charge"]==2,:]),
linear_model_z2.predict(X"Charge"]==2,"CCS"],
ccs_df_test.loc[ccs_df[=set_alpha,
alpha=set_size,
s="Z=2"
label
)
# Scatter plot the observations on the test set against the predictions on the same set (z=3)
plt.scatter(=X_matrix_count_test.loc[ccs_df["Charge"]==3,:]),
linear_model_z3.predict(X"Charge"]==3,"CCS"],
ccs_df_test.loc[ccs_df[=set_alpha,
alpha=set_size,
s="Z=3"
label
)
# Scatter plot the observations on the test set against the predictions on the same set (z=4)
plt.scatter(=X_matrix_count_test.loc[ccs_df["Charge"]==4,:]),
linear_model_z4.predict(X"Charge"]==4,"CCS"],
ccs_df_test.loc[ccs_df[=set_alpha,
alpha=set_size,
s="Z=4"
label
)
# Plot a diagonal the points should be one
300,1100],[300,1100],c="grey")
plt.plot([
# Add a legend for the charge states
= plt.legend()
legend
# Make sure the legend labels are visible and big enough
for lh in legend.legendHandles:
25])
lh.set_sizes([1)
lh.set_alpha(
# Get the predictions and calculate performance metrics
= linear_model_z2.predict(X_matrix_count_test.loc[ccs_df["Charge"]==3,:])
predictions = round(sum((abs(predictions-ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])/ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])*100)/len(predictions),3)
mare = round(pearsonr(predictions,ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])[0],3)
pcc = round(np.percentile((abs(predictions-ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])/ccs_df_test.loc[ccs_df["Charge"]==3,"CCS"])*100,95)*2,2)
perc_95
f"Linear model - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}% (z3 model for z3 observations)")
plt.title(
= plt.gca()
ax 'equal')
ax.set_aspect(
"Observed CCS (^2)")
plt.xlabel("Predicted CCS (^2)")
plt.ylabel(
plt.show()
It is clear that the predictions and observations are on the diagonal. This means that they are very similar. However, there are still some differences between observations and predictions.
In the previous example, we trained models for charge state separately. This is slightly inconvenient and other charge states might still be able to provide useful training examples. As long as the model corrects for the right charge state of course. In the next example we add charge state to the feature matrix. The linear model should be (partially…) able to account for the charge states of peptides.
# Make a new copy of feature matrix and add charge as a feature
= X_matrix_count_train.copy()
X_matrix_count_charge_train "charge"] = ccs_df_train["Charge"]
X_matrix_count_charge_train[
= X_matrix_count_test.copy()
X_matrix_count_charge_test "charge"] = ccs_df_test["Charge"] X_matrix_count_charge_test[
# Fit the linear model, but this time with the charge as a feature
= LinearRegression()
linear_model
linear_model.fit(=X_matrix_count_charge_train,
X=ccs_df_train.loc[:,"CCS"]
y )
LinearRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LinearRegression()
if len(ccs_df.index) < 1e4:
= 0.2
set_alpha = 3
set_size else:
= 0.05
set_alpha = 1
set_size
# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df["Charge"]==2,:]),
linear_model.predict(X"Charge"]==2,"CCS"],
ccs_df_test.loc[ccs_df[=set_alpha,
alpha=1,
s="Z=2"
label
)
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df["Charge"]==3,:]),
linear_model.predict(X"Charge"]==3,"CCS"],
ccs_df_test.loc[ccs_df[=set_alpha,
alpha=1,
s="Z=3"
label
)
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df["Charge"]==4,:]),
linear_model.predict(X"Charge"]==4,"CCS"],
ccs_df_test.loc[ccs_df[=set_alpha,
alpha=set_size,
s="Z=4"
label
)
# Plot a diagonal the points should be one
300,1100],[300,1100],c="grey")
plt.plot([
= plt.legend()
legend
for lh in legend.legendHandles:
25])
lh.set_sizes([1)
lh.set_alpha(
# Get the predictions and calculate performance metrics
= linear_model.predict(X=X_matrix_count_charge_test)
predictions = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
mare = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
pcc = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)
perc_95
f"Linear model - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")
plt.title(
= plt.gca()
ax 'equal')
ax.set_aspect(
"Observed CCS (^2)")
plt.xlabel("Predicted CCS (^2)")
plt.ylabel(
plt.show()
With this model we are capable to predict CCS values for all three charge states (maybe more; be careful with extrapolation). However, it also shows that both z3 and z4 are not optimally predicted. Especially z4 we can probably draw a line manually that provides better performance than the current model. The incapability of the model to correctly predict some of these values is largely due to the linear algorithm. With this algorithm we can only fit “simple” linear relations, but more complex relations are not modeled correctly. In the next section we will fit a non-linear model that is able to capture these complex relations better. However, keep in mind that more complex models are usually also able to overfit data better, resulting in poorer generalization performance.
Training an RF (non-linear) regression model for CCS prediction
In this section we will fit a random forest (RF) regression model. We hope to fit some of the non-linear relations present in the data. The RF algorithm fits multiple decision trees, but what makes these trees different is the random selection of instances (peptides) and/or features (amino acid count). The predictions between the forest of trees can be averaged to obtain a single prediction per peptide (instead of multiple for the same peptide). Later we will see that the algorithm might actually not be suitable for fitting this type of data.
from sklearn.ensemble import RandomForestRegressor
# Make a new copy of feature matrix and add charge as a feature
= X_matrix_count_train.copy()
X_matrix_count_charge_train "charge"] = ccs_df_train["Charge"]
X_matrix_count_charge_train[
= X_matrix_count_test.copy()
X_matrix_count_charge_test "charge"] = ccs_df_test["Charge"] X_matrix_count_charge_test[
# Initialize a RF object, note the hyperparameters that the model will follow
= RandomForestRegressor(
rf_model =20,
max_depth=50,
n_estimators=-1
n_jobs
)
# Fit the RF model
rf_model.fit(=X_matrix_count_charge_train,
X=ccs_df_train.loc[:,"CCS"]
y )
RandomForestRegressor(max_depth=20, n_estimators=50, n_jobs=-1)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
RandomForestRegressor(max_depth=20, n_estimators=50, n_jobs=-1)
if len(ccs_df.index) < 1e4:
= 0.2
set_alpha = 3
set_size else:
= 0.05
set_alpha = 1
set_size
# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==2,:]),
rf_model.predict(X"Charge"]==2,"CCS"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=2"
label
)
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==3,:]),
rf_model.predict(X"Charge"]==3,"CCS"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=3"
label
)
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==4,:]),
rf_model.predict(X"Charge"]==4,"CCS"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=4"
label
)
# Plot a diagonal the points should be one
300,1100],[300,1100],c="grey")
plt.plot([
= plt.legend()
legend
for lh in legend.legendHandles:
25])
lh.set_sizes([1)
lh.set_alpha(
# Get the predictions and calculate performance metrics
= rf_model.predict(X=X_matrix_count_charge_test)
predictions = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
mare = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
pcc = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)
perc_95
f"RF - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")
plt.title(
= plt.gca()
ax 'equal')
ax.set_aspect(
"Observed CCS (^2)")
plt.xlabel("Predicted CCS (^2)")
plt.ylabel(
plt.show()
As can be observed the problem with z=4 splitting up is gone, probably due to the capability of RF to fit non-linear relations. However, we see quite a large deviation on the diagonal. One of the major causes of this problem is the exclusion of amino acid counts for the decision trees. Although this is fundamental to the inner workings of RF, it means that we cannot take the excluded amino acids into account and these values are likely to be replaced by average expected volume to other (non-excluded) amino acids. RF performs very well when features correlate, and predictions are not fully dependent on the inclusion of all features. Next we will look at a decision tree algorithm (XGBoost) that does not rely on the exclusion of features.
PS note that you might be able to fit a much better model by using a much larger number of trees, but overall the problem largely remains, and it is better to choose an algorithm that respects/fits your data best.
Training a XGBoost (non-linear) regression model for CCS prediction
In this section we will fit a XGBoost regression model. This algorithm works by training a sequence of underfitted models. Each model in the sequence receives the output of the previous decision tree models. This combination of trees allows to fit the data well without greatly overfitting it.
from xgboost import XGBRegressor
# Make a new copy of feature matrix and add charge as a feature
= X_matrix_count_train.copy()
X_matrix_count_charge_train "charge"] = ccs_df_train["Charge"]
X_matrix_count_charge_train[
= X_matrix_count_test.copy()
X_matrix_count_charge_test "charge"] = ccs_df_test["Charge"] X_matrix_count_charge_test[
# Initialize the XGB object
= XGBRegressor(
xgb_model =12,
max_depth=250
n_estimators
)
# Fit the XGB model
xgb_model.fit(
X_matrix_count_charge_train,"CCS"]
ccs_df_train.loc[:, )
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=12, max_leaves=0, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=250, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, ...)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None, colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise', importance_type=None, interaction_constraints='', learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4, max_delta_step=0, max_depth=12, max_leaves=0, min_child_weight=1, missing=nan, monotone_constraints='()', n_estimators=250, n_jobs=0, num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1, ...)
if len(ccs_df.index) < 1e4:
= 0.2
set_alpha = 3
set_size else:
= 0.05
set_alpha = 1
set_size
# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter("Charge"]==2,"CCS"],
ccs_df_test.loc[ccs_df_test[=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==2,:]),
xgb_model.predict(X=set_alpha,
alpha=set_size,
s="Z=2")
label
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==3,:]),
xgb_model.predict(X"Charge"]==3,"CCS"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=3"
label
)
plt.scatter(=X_matrix_count_charge_test.loc[ccs_df_test["Charge"]==4,:]),
xgb_model.predict(X"Charge"]==4,"CCS"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=4"
label
)
# Plot a diagonal the points should be one
300,1100],[300,1100],c="grey")
plt.plot([
= plt.legend()
legend
for lh in legend.legendHandles:
25])
lh.set_sizes([1)
lh.set_alpha(
# Get the predictions and calculate performance metrics
= xgb_model.predict(X_matrix_count_charge_test)
predictions = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
mare = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
pcc = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)
perc_95
f"XGBoost - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")
plt.title(
= plt.gca()
ax 'equal')
ax.set_aspect(
"Observed CCS (^2)")
plt.xlabel("Predicted CCS (^2)")
plt.ylabel(
plt.show()
Training a deep learning LSTM model for CCS prediction
The deviation on the diagonal has been decreased significantly. But… A decision tree based algorithm is usually not the best for a regression model. Since the target data is continuous a model that can respect this structure is likely to perform better. Furthermore, up till now we simply counted amino acids, but structure is important. So to get the most out of the data we need to use the exact positions of amino acids.
Also… We have a lot of data it makes sense to use deep learning (DL). DL models are usually capable of learning more complex relations than traditional algorithms. Furthormore, for traditional ML algorithms we usually need to engineer features, while DL can usually work directly from raw data. DL is able to construct its own features.
from tensorflow.keras.layers import Dense, concatenate, Input, Bidirectional, LSTM
from tensorflow.keras.models import Model
import tensorflow as tf
As mentioned before, we want to use features that can also tell us something about the potential structure of the peptide. This means we need to take the sequence of the peptide into account and not just the amino acid counts. For this we will use a ‘one-hot encoding’, in this matrix each position in the peptide are the columns (number of columns equals the length of the peptide) and each amino acid per position has its own row (for the standard amino acids this is 20). So as a result we create a matrix that is the length of the peptide by the amount of unique amino acids in the whole data set. For each position we indicate the presence with a ‘1’ and absence with ‘0’. As a result the sum of each columnn is ‘1’ and the sum of the whole matrix equals the length of the peptide.
def aa_seq_to_one_hot(seq,padding_length=60):
# Although padding is not needed for an LSTM, we might need it if we for example apply a CNN
# Calculate how much padding is needed
= len(seq)
seq_len if seq_len > padding_length:
= seq[0:padding_length]
seq = len(seq)
seq_len
# Add padding for peptides that are too short
= "".join(["X"] * (padding_length - len(seq)))
padding = seq + padding
seq
# Initialize all feature matrix
= np.zeros(
matrix_hc len(aa_to_pos.keys()), len(seq)), dtype=np.int8)
(
# Fill the one-hot matrix, when we encounter an 'X' it should be the end of the sequence
for idx,aa in enumerate(seq):
if aa == "X":
break
= 1
matrix_hc[aa_to_pos[aa],idx]
return matrix_hc
# Calculate the one-hot matrices and stack them
# Result is a 3D matrix where the first dimension is each peptide, and then the last two dims are the one-hot matrix
= np.stack(ccs_df_train["sequence"].apply(aa_seq_to_one_hot).values)
one_hot_encoded_train = np.stack(ccs_df_test["sequence"].apply(aa_seq_to_one_hot).values) one_hot_encoded_test
if len(ccs_df.index) < 1e4:
= 100
epochs = 12
num_lstm = 128
batch_size else:
= 1024
batch_size = 10
epochs = 64
num_lstm
= 128
batch_size = 0.1
v_split = "adam"
optimizer = "mean_squared_error"
loss
# The architecture chosen consists of two inputs: (1) the one-hot matrix and (2) the charge
# The first part is a biderectional LSTM (a), in paralel we have dense layers containing the charge (b)
# Both a and b are concatenated to go through several dense layers (c)
= Input(shape=(None, one_hot_encoded_train.shape[2]))
input_a = Bidirectional(LSTM(num_lstm,return_sequences=True))(input_a)
a = Bidirectional(LSTM(num_lstm))(a)
a = Model(inputs=input_a, outputs=a)
a
= Input(shape=(1,))
input_b = Dense(5, activation="relu")(input_b)
b = Model(inputs=input_b, outputs=b)
b
= concatenate([a.output, b.output],axis=-1)
c
= Dense(64, activation="relu")(c)
c = Dense(32, activation="relu")(c)
c = Dense(1, activation="relu")(c)
c
# Create the model with specified inputs and outputs
= Model(inputs=[a.input, b.input], outputs=c)
model
compile(optimizer=optimizer, loss=loss)
model.
# Fit the model on the training data
= model.fit(
history "Charge"]),
(one_hot_encoded_train,ccs_df_train.loc[:,"CCS"],
ccs_df_train.loc[:,=epochs,
epochs=batch_size,
batch_size=v_split
validation_split )
Epoch 1/10
4550/4550 [==============================] - 108s 23ms/step - loss: 5536.7256 - val_loss: 467.1984
Epoch 2/10
4550/4550 [==============================] - 102s 22ms/step - loss: 436.4136 - val_loss: 403.9187
Epoch 3/10
4550/4550 [==============================] - 100s 22ms/step - loss: 395.7551 - val_loss: 384.5470
Epoch 4/10
4550/4550 [==============================] - 103s 23ms/step - loss: 376.4315 - val_loss: 382.0145
Epoch 5/10
4550/4550 [==============================] - 102s 23ms/step - loss: 364.8819 - val_loss: 395.8338
Epoch 6/10
4550/4550 [==============================] - 106s 23ms/step - loss: 357.4092 - val_loss: 355.8185
Epoch 7/10
4550/4550 [==============================] - 104s 23ms/step - loss: 342.5216 - val_loss: 312.9571
Epoch 8/10
4550/4550 [==============================] - 104s 23ms/step - loss: 275.4517 - val_loss: 274.8963
Epoch 9/10
4550/4550 [==============================] - 110s 24ms/step - loss: 253.2955 - val_loss: 252.0118
Epoch 10/10
4550/4550 [==============================] - 105s 23ms/step - loss: 240.6064 - val_loss: 260.1613
# Predict CCS values test set
"LSTM_predictions"] = model.predict((one_hot_encoded_test,ccs_df_test.loc[:,"Charge"])) ccs_df_test[
2247/2247 [==============================] - 20s 8ms/step
if len(ccs_df.index) < 1e4:
= 0.2
set_alpha = 3
set_size else:
= 0.05
set_alpha = 1
set_size
# Scatter plot the observations on the test set against the predictions on the same set
plt.scatter("Charge"]==2,"CCS"],
ccs_df_test.loc[ccs_df_test["Charge"]==2,"LSTM_predictions"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=2"
label
)
plt.scatter("Charge"]==3,"CCS"],
ccs_df_test.loc[ccs_df_test["Charge"]==3,"LSTM_predictions"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=3"
label
)
plt.scatter("Charge"]==4,"CCS"],
ccs_df_test.loc[ccs_df_test["Charge"]==4,"LSTM_predictions"],
ccs_df_test.loc[ccs_df_test[=set_alpha,
alpha=set_size,
s="Z=4"
label
)
# Plot a diagonal the points should be one
300,1100],[300,1100],c="grey")
plt.plot([
= plt.legend()
legend
for lh in legend.legendHandles:
25])
lh.set_sizes([1)
lh.set_alpha(
# Get the predictions and calculate performance metrics
= ccs_df_test["LSTM_predictions"]
predictions = round(sum((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100)/len(predictions),3)
mare = round(pearsonr(predictions,ccs_df_test.loc[:,"CCS"])[0],3)
pcc = round(np.percentile((abs(predictions-ccs_df_test.loc[:,"CCS"])/ccs_df_test.loc[:,"CCS"])*100,95)*2,2)
perc_95
f"LSTM - PCC: {pcc} - MARE: {mare}% - 95th percentile: {perc_95}%")
plt.title(
= plt.gca()
ax 'equal')
ax.set_aspect(
"Observed CCS (^2)")
plt.xlabel("Predicted CCS (^2)")
plt.ylabel(
plt.show()
It is clear that the performance of this model is much better. But… Performance can be improved a lot more by for example tuning hyperparameters like the network architecture or number of epochs.
Hope you enjoyed this tutorial! Feel free to edit it and make a pull request!