Tutorial 7: Assembling custom pipelines
[1]:
import homelette as hm
import matplotlib as plt
import seaborn as sns
Introduction
Welcome to the final tutorial on homelette
. This tutorial is about combining what we learnt in the previous tutorials about model generating and model evaluating building blocks.
The strength of homelette
lies in its ability to A) be almost freely extendable by the user (see Tutorial 4) and B) in the ease with which pre-defined or custom-made building blocks for model generation and evaluation can be assembled into custom pipelines. This tutorial showcases B).
For our target sequence, ARAF, we will identify templates and generate alignments with the AlignmentGenerator_pdb
building block [1,2,3,4]. We will select two templates, BRAF (3NY5) and RAF1 (4G0N). We will build models for ARAF with two different routines, Routine_automodel_default
and Routine_automodel_slow
[5,6], and from the different templates. The generated models will be evaluated by SOAP
scores and MolProbity and a combined score will be calculated using Borda Count [7,8,9,10].
Alignment
Consistent with the other tutorials, we will be modelling the protein ARAF. For this tutorial, we will use the AlignmentGenerator_pdb
in order to search for templates, create an alignment, and process both the templates as well as the alignment:
[2]:
gen = hm.alignment.AlignmentGenerator_pdb.from_fasta('data/alignments/ARAF.fa')
[3]:
# search for templates and generate first alignment
gen.get_suggestion()
gen.show_suggestion()
Querying PDB...
Query successful, 16 found!
Retrieving sequences...
Sequences succefully retrieved!
Generating alignment...
Alignment generated!
[3]:
template | coverage | identity | |
---|---|---|---|
0 | 6XI7_2 | 100.0 | 60.27 |
1 | 1C1Y_2 | 100.0 | 60.27 |
2 | 1GUA_2 | 100.0 | 60.27 |
3 | 4G0N_2 | 100.0 | 60.27 |
4 | 4G3X_2 | 100.0 | 60.27 |
5 | 6VJJ_2 | 100.0 | 60.27 |
6 | 6XGU_2 | 100.0 | 60.27 |
7 | 6XGV_2 | 100.0 | 60.27 |
8 | 6XHA_2 | 100.0 | 60.27 |
9 | 6XHB_2 | 100.0 | 60.27 |
10 | 7JHP_2 | 100.0 | 60.27 |
11 | 3KUC_2 | 100.0 | 58.90 |
12 | 3KUD_2 | 100.0 | 58.90 |
13 | 3NY5_1 | 100.0 | 58.90 |
14 | 6NTD_2 | 100.0 | 53.42 |
15 | 6NTC_2 | 100.0 | 52.05 |
For this example, we will choose one template of BRAF (3NY5) and one template from RAF1 (4G0N):
[4]:
# select templates and show alignment
gen.select_templates(['3NY5_1', '4G0N_2'])
gen.alignment.print_clustal(70)
ARAF -------------GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVT
3NY5_1 MGHHHHHHSHMQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKALMMRGLIPECCAVYRIQ---DGEKKPI
4G0N_2 -----------TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
ARAF AWDTAIAPLDGEELIVEVL----------
3NY5_1 GWDTDISWLTGEELHVEVLENVPLTTHNF
4G0N_2 DWNTDAASLIGEELQVDFL----------
Next, we download the template structures and process both the alignment and the structures:
[5]:
# download structures, process alignment and structures
gen.get_pdbs()
gen.show_suggestion()
Guessing template naming format...
Template naming format guessed: polymer_entity!
Checking template dir...
Template dir not found...
New template dir created at
"/home/homelette/workdir/templates"!
Processing templates:
3NY5 downloading from PDB...
3NY5 downloaded!
3NY5_A: Chain extracted!
3NY5_A: Alignment updated!
3NY5_A: PDB processed!
3NY5_B: Chain extracted!
3NY5_B: Alignment updated!
3NY5_B: PDB processed!
3NY5_C: Chain extracted!
3NY5_C: Alignment updated!
3NY5_C: PDB processed!
3NY5_D: Chain extracted!
3NY5_D: Alignment updated!
3NY5_D: PDB processed!
4G0N downloading from PDB...
4G0N downloaded!
4G0N_B: Chain extracted!
4G0N_B: Alignment updated!
4G0N_B: PDB processed!
Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"/home/homelette/workdir/templates".
[5]:
template | coverage | identity | |
---|---|---|---|
0 | 4G0N_B | 100.00 | 60.27 |
1 | 3NY5_B | 94.52 | 57.53 |
2 | 3NY5_A | 93.15 | 57.53 |
3 | 3NY5_C | 93.15 | 57.53 |
4 | 3NY5_D | 91.78 | 57.53 |
We can see that there are multiple chains of 3NY5 that fit our alignment. One of the chains has less missing residues than the other ones, so we are choosing this one:
[6]:
# select templates
gen.select_templates(['4G0N_B', '3NY5_B'])
gen.alignment.print_clustal(70)
gen.show_suggestion()
ARAF ----GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVTAWDTAIAPL
4G0N_B --TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASL
3NY5_B SHQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ-----EKKPIGWDTDISWL
ARAF DGEELIVEVL--------
4G0N_B IGEELQVDFL--------
3NY5_B TGEELHVEVLENVPLTTH
[6]:
template | coverage | identity | |
---|---|---|---|
0 | 4G0N_B | 100.00 | 60.27 |
1 | 3NY5_B | 94.52 | 57.53 |
Now that we have our templates prepared and aligned, we can now define a custom Task
object in order to assemble homelette
building blocks into a pipeline:
Custom pipeline
The easiest way to formulate custom pipelines by assembling the homelette
building blocks of model building and evaluation is to construct custom Task
objects:
[7]:
class CustomPipeline(hm.Task):
'''
Example for a cumstom pipeline
'''
def model_generation(self, templates):
# model generation with automodel_default
self.execute_routine(tag='automodel_def_' + '-'.join(templates),
routine = hm.routines.Routine_automodel_default,
templates = templates,
template_location = './templates/',
n_models = 20,
n_threads = 5)
# model generation with autmodel_slow
self.execute_routine(tag='autmodel_slow_' + '-'.join(templates),
routine = hm.routines.Routine_automodel_slow,
templates = templates,
template_location = './templates/',
n_models = 20,
n_threads = 5)
def model_evaluation(self):
# perform evaluation
self.evaluate_models(hm.evaluation.Evaluation_mol_probity,
n_threads=5)
self.evaluate_models(hm.evaluation.Evaluation_soap_protein,
n_threads=5)
self.evaluate_models(hm.evaluation.Evaluation_qmean4,
n_threads=5)
ev = self.get_evaluation()
# borda count for best models
ev['points_soap'] = ev.shape[0] - ev['soap_protein'].rank()
ev['points_mol_probity'] = ev.shape[0] - ev['mp_score'].rank()
ev['borda_score'] = ev['points_soap'] + ev['points_mol_probity']
ev['borda_rank'] = ev['borda_score'].rank(ascending=False)
ev = ev.drop(labels=['points_soap', 'points_mol_probity'], axis=1)
return ev
We have constructed a custom Task
object (more specifically, a custom objects that inherits all methods and attributes from Task
) and added two more functions: model_generation
and model_evaluation
.
In CustomPipeline.model_generation
we are using two routines (Routine_automodel_default
and Routine_automodel_slow
) to generate 20 models each. In CustomPipeline.model_generation
we evaluate the models using Evaluation_mol_probity
and Evaluation_soap_protein
and then rank the generated models based on both evaluation metrics using Borda Count.
After constructing our pipeline, let’s execute it with two different templates as an example:
After having a custom Task
object defined, we can initialize it from the AlignmentGenerator
in order to do the modelling and evaluation:
[8]:
# initialize task from alignment generator
t = gen.initialize_task(
task_name = 'Tutorial7',
overwrite = True,
task_class = CustomPipeline)
[9]:
# execute pipeline for different templates
t.model_generation(['3NY5_B'])
t.model_generation(['4G0N_B'])
df_eval = t.model_evaluation()
We have successfully generated and evaluated 80 models.
[10]:
# get template from tag
df_eval['template'] = df_eval['tag'].str.contains('3NY5').map({True: '3NY5', False: '4G0N'})
[11]:
df_eval.sort_values(by = 'borda_rank').head(10)
[11]:
model | tag | routine | mp_score | soap_protein | qmean4 | qmean4_z_score | borda_score | borda_rank | template | |
---|---|---|---|---|---|---|---|---|---|---|
64 | autmodel_slow_4G0N_B_5.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.21 | -45545.746094 | 0.814469 | 0.255860 | 149.5 | 1.0 | 4G0N |
77 | autmodel_slow_4G0N_B_18.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.17 | -45043.023438 | 0.775498 | -0.340560 | 143.0 | 2.0 | 4G0N |
38 | autmodel_slow_3NY5_B_19.pdb | autmodel_slow_3NY5_B | automodel_slow | 2.42 | -48817.878906 | 0.769190 | -0.437096 | 141.0 | 3.0 | 3NY5 |
69 | autmodel_slow_4G0N_B_10.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.30 | -45205.257812 | 0.805243 | 0.114666 | 138.0 | 4.0 | 4G0N |
63 | autmodel_slow_4G0N_B_4.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.26 | -44921.707031 | 0.771055 | -0.408556 | 134.0 | 5.0 | 4G0N |
79 | autmodel_slow_4G0N_B_20.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.24 | -44596.234375 | 0.787342 | -0.159296 | 131.5 | 6.0 | 4G0N |
72 | autmodel_slow_4G0N_B_13.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.21 | -44206.707031 | 0.796167 | -0.024243 | 128.5 | 7.0 | 4G0N |
73 | autmodel_slow_4G0N_B_14.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.39 | -44924.730469 | 0.754554 | -0.661071 | 126.0 | 8.0 | 4G0N |
49 | automodel_def_4G0N_B_10.pdb | automodel_def_4G0N_B | automodel_default | 2.47 | -45311.910156 | 0.767716 | -0.459645 | 125.0 | 9.0 | 4G0N |
34 | autmodel_slow_3NY5_B_15.pdb | autmodel_slow_3NY5_B | automodel_slow | 2.33 | -44530.144531 | 0.720679 | -1.179500 | 124.0 | 10.0 | 3NY5 |
We can see that most of the best 10 models were generated with the slower routine Routine_autmodel_slow
. This is to be expected, as this routine spends more time on model refinement and should therefore produce “better” models.
Next, we visualize the results of our evaluation with seaborn
.
Visualization
[12]:
# visualize combined score with seaborn
%matplotlib inline
# set font size
plt.rcParams.update({'font.size': 16})
plot = sns.boxplot(x = 'routine', y = 'borda_score', hue='template', data=df_eval,
palette='viridis')
plot.set(xlabel = 'Routine')
plot.set(ylabel = 'Combined Score')
plot.figure.set_size_inches(10, 10)
plot.legend(title = 'Template', loc = 'lower right', ncol = 2, fancybox = True)
#plot.figure.savefig('tutorial7.png', dpi=300)
[12]:
<matplotlib.legend.Legend at 0x7f23799902e0>
As expected, the routine which spends more time on model refinement (Routine_automodel_slow
) produces on average better results. Also, there are interesting differences between the templates used.
Further Reading
Congratulations on finishing the final tutorial about homelette
. You might also be interested in the other tutorials:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 6: Learn about modelling protein complexes.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Rose, Y., Duarte, J. M., Lowe, R., Segura, J., Bi, C., Bhikadiya, C., Chen, L., Rose, A. S., Bittrich, S., Burley, S. K., & Westbrook, J. D. (2021). RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. Journal of Molecular Biology, 433(11), 166704. https://doi.org/10.1016/J.JMB.2020.11.003
[2] Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 2017 35:11, 35(11), 1026–1028. https://doi.org/10.1038/nbt.3988
[3] Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J. D., & Higgins, D. G. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7(1), 539. https://doi.org/10.1038/MSB.2011.75
[4] Sievers, F., & Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science, 27(1), 135–145. https://doi.org/10.1002/PRO.3290
[5] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[6] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
[7] Dong, G. Q., Fan, H., Schneidman-Duhovny, D., Webb, B., Sali, A., & Tramontano, A. (2013). Optimized atomic statistical potentials: Assessment of protein interfaces and loops. Bioinformatics, 29(24), 3158–3166. https://doi.org/10.1093/bioinformatics/btt560
[8] Davis, I. W., Leaver-Fay, A., Chen, V. B., Block, J. N., Kapral, G. J., Wang, X., Murray, L. W., Arendall, W. B., Snoeyink, J., Richardson, J. S., & Richardson, D. C. (2007). MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research, 35(suppl_2), W375–W383. https://doi.org/10.1093/NAR/GKM216
[9] Chen, V. B., Arendall, W. B., Headd, J. J., Keedy, D. A., Immormino, R. M., Kapral, G. J., Murray, L. W., Richardson, J. S., & Richardson, D. C. (2010). MolProbity: All-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D: Biological Crystallography, 66(1), 12–21. https://doi.org/10.1107/S0907444909042073
[10] Williams, C. J., Headd, J. J., Moriarty, N. W., Prisant, M. G., Videau, L. L., Deis, L. N., Verma, V., Keedy, D. A., Hintze, B. J., Chen, V. B., Jain, S., Lewis, S. M., Arendall, W. B., Snoeyink, J., Adams, P. D., Lovell, S. C., Richardson, J. S., & Richardson, D. C. (2018). MolProbity: More and better reference data for improved all-atom structure validation. Protein Science, 27(1), 293–315. https://doi.org/10.1002/pro.3330
Session Info
[13]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
matplotlib 3.1.2
pandas 1.5.3
seaborn 0.12.2
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib_inline 0.1.6
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
scipy 1.10.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:50