Tutorial 7: Assembling custom pipelines

[1]:
import homelette as hm

import matplotlib as plt
import seaborn as sns

Introduction

Welcome to the final tutorial on homelette. This tutorial is about combining what we learnt in the previous tutorials about model generating and model evaluating building blocks.

The strength of homelette lies in its ability to A) be almost freely extendable by the user (see Tutorial 4) and B) in the ease with which pre-defined or custom-made building blocks for model generation and evaluation can be assembled into custom pipelines. This tutorial showcases B).

For our target sequence, ARAF, we will identify templates and generate alignments with the AlignmentGenerator_pdb building block [1,2,3,4]. We will select two templates, BRAF (3NY5) and RAF1 (4G0N). We will build models for ARAF with two different routines, Routine_automodel_default and Routine_automodel_slow [5,6], and from the different templates. The generated models will be evaluated by SOAP scores and MolProbity and a combined score will be calculated using Borda Count [7,8,9,10].

Alignment

Consistent with the other tutorials, we will be modelling the protein ARAF. For this tutorial, we will use the AlignmentGenerator_pdb in order to search for templates, create an alignment, and process both the templates as well as the alignment:

[2]:
gen = hm.alignment.AlignmentGenerator_pdb.from_fasta('data/alignments/ARAF.fa')
[3]:
# search for templates and generate first alignment
gen.get_suggestion()
gen.show_suggestion()
Querying PDB...
Query successful, 16 found!

Retrieving sequences...
Sequences succefully retrieved!

Generating alignment...
Alignment generated!

[3]:
template coverage identity
0 6XI7_2 100.0 60.27
1 1C1Y_2 100.0 60.27
2 1GUA_2 100.0 60.27
3 4G0N_2 100.0 60.27
4 4G3X_2 100.0 60.27
5 6VJJ_2 100.0 60.27
6 6XGU_2 100.0 60.27
7 6XGV_2 100.0 60.27
8 6XHA_2 100.0 60.27
9 6XHB_2 100.0 60.27
10 7JHP_2 100.0 60.27
11 3KUC_2 100.0 58.90
12 3KUD_2 100.0 58.90
13 3NY5_1 100.0 58.90
14 6NTD_2 100.0 53.42
15 6NTC_2 100.0 52.05

For this example, we will choose one template of BRAF (3NY5) and one template from RAF1 (4G0N):

[4]:
# select templates and show alignment
gen.select_templates(['3NY5_1', '4G0N_2'])
gen.alignment.print_clustal(70)
ARAF        -------------GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVT
3NY5_1      MGHHHHHHSHMQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKALMMRGLIPECCAVYRIQ---DGEKKPI
4G0N_2      -----------TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL


ARAF        AWDTAIAPLDGEELIVEVL----------
3NY5_1      GWDTDISWLTGEELHVEVLENVPLTTHNF
4G0N_2      DWNTDAASLIGEELQVDFL----------


Next, we download the template structures and process both the alignment and the structures:

[5]:
# download structures, process alignment and structures
gen.get_pdbs()
gen.show_suggestion()
Guessing template naming format...
Template naming format guessed: polymer_entity!

Checking template dir...
Template dir not found...
New template dir created at
"/home/homelette/workdir/templates"!

Processing templates:

3NY5 downloading from PDB...
3NY5 downloaded!
3NY5_A: Chain extracted!
3NY5_A: Alignment updated!
3NY5_A: PDB processed!
3NY5_B: Chain extracted!
3NY5_B: Alignment updated!
3NY5_B: PDB processed!
3NY5_C: Chain extracted!
3NY5_C: Alignment updated!
3NY5_C: PDB processed!
3NY5_D: Chain extracted!
3NY5_D: Alignment updated!
3NY5_D: PDB processed!
4G0N downloading from PDB...
4G0N downloaded!
4G0N_B: Chain extracted!
4G0N_B: Alignment updated!
4G0N_B: PDB processed!

Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"/home/homelette/workdir/templates".
[5]:
template coverage identity
0 4G0N_B 100.00 60.27
1 3NY5_B 94.52 57.53
2 3NY5_A 93.15 57.53
3 3NY5_C 93.15 57.53
4 3NY5_D 91.78 57.53

We can see that there are multiple chains of 3NY5 that fit our alignment. One of the chains has less missing residues than the other ones, so we are choosing this one:

[6]:
# select templates
gen.select_templates(['4G0N_B', '3NY5_B'])
gen.alignment.print_clustal(70)
gen.show_suggestion()
ARAF        ----GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVTAWDTAIAPL
4G0N_B      --TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASL
3NY5_B      SHQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ-----EKKPIGWDTDISWL


ARAF        DGEELIVEVL--------
4G0N_B      IGEELQVDFL--------
3NY5_B      TGEELHVEVLENVPLTTH


[6]:
template coverage identity
0 4G0N_B 100.00 60.27
1 3NY5_B 94.52 57.53

Now that we have our templates prepared and aligned, we can now define a custom Task object in order to assemble homelette building blocks into a pipeline:

Custom pipeline

The easiest way to formulate custom pipelines by assembling the homelette building blocks of model building and evaluation is to construct custom Task objects:

[7]:
class CustomPipeline(hm.Task):
    '''
    Example for a cumstom pipeline
    '''
    def model_generation(self, templates):
        # model generation with automodel_default
        self.execute_routine(tag='automodel_def_' + '-'.join(templates),
                             routine = hm.routines.Routine_automodel_default,
                             templates = templates,
                             template_location = './templates/',
                             n_models = 20,
                             n_threads = 5)
        # model generation with autmodel_slow
        self.execute_routine(tag='autmodel_slow_' + '-'.join(templates),
                             routine = hm.routines.Routine_automodel_slow,
                             templates = templates,
                             template_location = './templates/',
                             n_models = 20,
                             n_threads = 5)

    def model_evaluation(self):
        # perform evaluation
        self.evaluate_models(hm.evaluation.Evaluation_mol_probity,
                             n_threads=5)
        self.evaluate_models(hm.evaluation.Evaluation_soap_protein,
                             n_threads=5)
        self.evaluate_models(hm.evaluation.Evaluation_qmean4,
                            n_threads=5)
        ev = self.get_evaluation()
        # borda count for best models
        ev['points_soap'] = ev.shape[0] - ev['soap_protein'].rank()
        ev['points_mol_probity'] = ev.shape[0] - ev['mp_score'].rank()
        ev['borda_score'] = ev['points_soap'] + ev['points_mol_probity']
        ev['borda_rank'] = ev['borda_score'].rank(ascending=False)
        ev = ev.drop(labels=['points_soap', 'points_mol_probity'], axis=1)
        return ev

We have constructed a custom Task object (more specifically, a custom objects that inherits all methods and attributes from Task) and added two more functions: model_generation and model_evaluation.

In CustomPipeline.model_generation we are using two routines (Routine_automodel_default and Routine_automodel_slow) to generate 20 models each. In CustomPipeline.model_generation we evaluate the models using Evaluation_mol_probity and Evaluation_soap_protein and then rank the generated models based on both evaluation metrics using Borda Count.

After constructing our pipeline, let’s execute it with two different templates as an example:

After having a custom Task object defined, we can initialize it from the AlignmentGenerator in order to do the modelling and evaluation:

[8]:
# initialize task from alignment generator
t = gen.initialize_task(
    task_name = 'Tutorial7',
    overwrite = True,
    task_class = CustomPipeline)
[9]:
# execute pipeline for different templates
t.model_generation(['3NY5_B'])
t.model_generation(['4G0N_B'])

df_eval = t.model_evaluation()

We have successfully generated and evaluated 80 models.

[10]:
# get template from tag
df_eval['template'] = df_eval['tag'].str.contains('3NY5').map({True: '3NY5', False: '4G0N'})
[11]:
df_eval.sort_values(by = 'borda_rank').head(10)
[11]:
model tag routine mp_score soap_protein qmean4 qmean4_z_score borda_score borda_rank template
64 autmodel_slow_4G0N_B_5.pdb autmodel_slow_4G0N_B automodel_slow 2.21 -45545.746094 0.814469 0.255860 149.5 1.0 4G0N
77 autmodel_slow_4G0N_B_18.pdb autmodel_slow_4G0N_B automodel_slow 2.17 -45043.023438 0.775498 -0.340560 143.0 2.0 4G0N
38 autmodel_slow_3NY5_B_19.pdb autmodel_slow_3NY5_B automodel_slow 2.42 -48817.878906 0.769190 -0.437096 141.0 3.0 3NY5
69 autmodel_slow_4G0N_B_10.pdb autmodel_slow_4G0N_B automodel_slow 2.30 -45205.257812 0.805243 0.114666 138.0 4.0 4G0N
63 autmodel_slow_4G0N_B_4.pdb autmodel_slow_4G0N_B automodel_slow 2.26 -44921.707031 0.771055 -0.408556 134.0 5.0 4G0N
79 autmodel_slow_4G0N_B_20.pdb autmodel_slow_4G0N_B automodel_slow 2.24 -44596.234375 0.787342 -0.159296 131.5 6.0 4G0N
72 autmodel_slow_4G0N_B_13.pdb autmodel_slow_4G0N_B automodel_slow 2.21 -44206.707031 0.796167 -0.024243 128.5 7.0 4G0N
73 autmodel_slow_4G0N_B_14.pdb autmodel_slow_4G0N_B automodel_slow 2.39 -44924.730469 0.754554 -0.661071 126.0 8.0 4G0N
49 automodel_def_4G0N_B_10.pdb automodel_def_4G0N_B automodel_default 2.47 -45311.910156 0.767716 -0.459645 125.0 9.0 4G0N
34 autmodel_slow_3NY5_B_15.pdb autmodel_slow_3NY5_B automodel_slow 2.33 -44530.144531 0.720679 -1.179500 124.0 10.0 3NY5

We can see that most of the best 10 models were generated with the slower routine Routine_autmodel_slow. This is to be expected, as this routine spends more time on model refinement and should therefore produce “better” models.

Next, we visualize the results of our evaluation with seaborn.

Visualization

[12]:
# visualize combined score with seaborn
%matplotlib inline

# set font size
plt.rcParams.update({'font.size': 16})

plot = sns.boxplot(x = 'routine', y = 'borda_score', hue='template', data=df_eval,
                   palette='viridis')
plot.set(xlabel = 'Routine')
plot.set(ylabel = 'Combined Score')
plot.figure.set_size_inches(10, 10)
plot.legend(title = 'Template', loc = 'lower right', ncol = 2, fancybox = True)
#plot.figure.savefig('tutorial7.png', dpi=300)
[12]:
<matplotlib.legend.Legend at 0x7f23799902e0>
_images/Tutorial7_AssemblingPipelines_25_1.png

As expected, the routine which spends more time on model refinement (Routine_automodel_slow) produces on average better results. Also, there are interesting differences between the templates used.

Further Reading

Congratulations on finishing the final tutorial about homelette. You might also be interested in the other tutorials:

  • Tutorial 1: Learn about the basics of homelette.

  • Tutorial 2: Learn more about already implemented routines for homology modelling.

  • Tutorial 3: Learn about the evaluation metrics available with homelette.

  • Tutorial 4: Learn about extending homelette’s functionality by defining your own modelling routines and evaluation metrics.

  • Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.

  • Tutorial 6: Learn about modelling protein complexes.

  • Tutorial 8: Learn about automated template identification, alignment generation and template processing.

References

[1] Rose, Y., Duarte, J. M., Lowe, R., Segura, J., Bi, C., Bhikadiya, C., Chen, L., Rose, A. S., Bittrich, S., Burley, S. K., & Westbrook, J. D. (2021). RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. Journal of Molecular Biology, 433(11), 166704. https://doi.org/10.1016/J.JMB.2020.11.003

[2] Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 2017 35:11, 35(11), 1026–1028. https://doi.org/10.1038/nbt.3988

[3] Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J. D., & Higgins, D. G. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7(1), 539. https://doi.org/10.1038/MSB.2011.75

[4] Sievers, F., & Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science, 27(1), 135–145. https://doi.org/10.1002/PRO.3290

[5] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626

[6] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3

[7] Dong, G. Q., Fan, H., Schneidman-Duhovny, D., Webb, B., Sali, A., & Tramontano, A. (2013). Optimized atomic statistical potentials: Assessment of protein interfaces and loops. Bioinformatics, 29(24), 3158–3166. https://doi.org/10.1093/bioinformatics/btt560

[8] Davis, I. W., Leaver-Fay, A., Chen, V. B., Block, J. N., Kapral, G. J., Wang, X., Murray, L. W., Arendall, W. B., Snoeyink, J., Richardson, J. S., & Richardson, D. C. (2007). MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research, 35(suppl_2), W375–W383. https://doi.org/10.1093/NAR/GKM216

[9] Chen, V. B., Arendall, W. B., Headd, J. J., Keedy, D. A., Immormino, R. M., Kapral, G. J., Murray, L. W., Richardson, J. S., & Richardson, D. C. (2010). MolProbity: All-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D: Biological Crystallography, 66(1), 12–21. https://doi.org/10.1107/S0907444909042073

[10] Williams, C. J., Headd, J. J., Moriarty, N. W., Prisant, M. G., Videau, L. L., Deis, L. N., Verma, V., Keedy, D. A., Hintze, B. J., Chen, V. B., Jain, S., Lewis, S. M., Arendall, W. B., Snoeyink, J., Adams, P. D., Lovell, S. C., Richardson, J. S., & Richardson, D. C. (2018). MolProbity: More and better reference data for improved all-atom structure validation. Protein Science, 27(1), 293–315. https://doi.org/10.1002/pro.3330

Session Info

[13]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette           1.4
matplotlib          3.1.2
pandas              1.5.3
seaborn             0.12.2
session_info        1.0.0
-----
PIL                 7.0.0
altmod              NA
anyio               NA
asttokens           NA
attr                19.3.0
babel               2.12.1
backcall            0.2.0
certifi             2022.12.07
chardet             3.0.4
charset_normalizer  3.1.0
comm                0.1.2
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.6
decorator           4.4.2
executing           1.2.0
fastjsonschema      NA
idna                3.4
importlib_metadata  NA
importlib_resources NA
ipykernel           6.21.3
ipython_genutils    0.2.0
jedi                0.18.2
jinja2              3.1.2
json5               NA
jsonschema          4.17.3
jupyter_events      0.6.3
jupyter_server      2.4.0
jupyterlab_server   2.20.0
kiwisolver          1.0.1
markupsafe          2.1.2
matplotlib_inline   0.1.6
modeller            10.4
more_itertools      NA
mpl_toolkits        NA
nbformat            5.7.3
numexpr             2.8.4
numpy               1.24.2
ost                 2.3.1
packaging           20.3
parso               0.8.3
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
platformdirs        3.1.1
prometheus_client   NA
promod3             3.2.1
prompt_toolkit      3.0.38
psutil              5.5.1
ptyprocess          0.7.0
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.9.5
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.14.0
pyparsing           2.4.6
pyrsistent          NA
pythonjsonlogger    NA
pytz                2022.7.1
qmean               NA
requests            2.28.2
rfc3339_validator   0.1.4
rfc3986_validator   0.1.1
scipy               1.10.1
send2trash          NA
sitecustomize       NA
six                 1.12.0
sniffio             1.3.0
stack_data          0.6.2
swig_runtime_data4  NA
tornado             6.2
traitlets           5.9.0
urllib3             1.26.15
wcwidth             NA
websocket           1.5.1
yaml                6.0
zipp                NA
zmq                 25.0.1
-----
IPython             8.11.0
jupyter_client      8.0.3
jupyter_core        5.2.0
jupyterlab          3.6.1
notebook            6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:50