Welcome to homelette’s documentation!
homelette is a Python package offering a unified interface to different software for generating and evaluating homology models. This enables users to easily assemble custom homology modelling pipelines. homelette is extensively documented, lightweight and easily extendable.

If you use homelette in your research, please cite the following article:
Philipp Junk, Christina Kiel, HOMELETTE: a unified interface to homology modelling software, Bioinformatics, 2021;, btab866, https://doi.org/10.1093/bioinformatics/btab866
Setting up homelette
This section explains how to set homelette up on your system. homelette is available on GitHub and PyPI. The easiest option to work with homelette is to use a docker container that has all dependencies already installed.
Installation
While installing the homelette base package is easy, some of its dependencies are quite complicated to install. If you just want to try out homelette, we would encourage you to start with our Docker image which has all these dependencies already installed.
homelette
homelette is easily available through our GitHub page (GitHub homelette) or through PyPI.
python3 -m pip install homelette
Please be aware that homelette requires Python 3.6.12 or newer.
Modelling and Evaluation Software
homelette doesn’t have model generating or model evaluating capabilities on its own. Instead, it provides a unified interface to other software with these capabilities.
None of the tools and packages listed here are “hard” dependencies in the way that homelette won’t work if you have them not installed. Actually, you can still use homelette without any of these packages. However, none of the pre-implemented building blocks would work that way. It is therefore strongly recommended that, in order to get the most out of homelette, to install as many of these tools and packages.
Again, we want to mention that we have prepared a Docker image that contains all of these dependencies, and we strongly recommend that you start there if you want to find out if homelette is useful for you.
MODELLER
Installation instructions for MODELLER can be found here: Installation MODELLER. Requires a license key (freely available for academic research) which can be requested here: License MODELLER.
altMOD
altMOD can be installed from here: GitHub altMOD. Please make sure that the altMOD directory is in your Python path.
ProMod3
ProMod3 has to be compiled from source, instructions can be found here: Installation ProMod3. Main dependencies are OpenMM (available through conda or from source) and OpenStructure (available here: Installation OpenStructure).
QMEAN
QMEAN has be compiled from source, instructions can be found here: GitLab QMEAN. Has the same dependencies as ProMod3.
SOAP potential
While the code for evaluation with SOAP is part of MODELLER, some files for SOAP are not included in the standard release and have to be downloaded separately. The files are available here Download SOAP.
Specifically, you need to have soap_protein_od.hdf5
available in your modlib directory. The modlib directory is placed at /usr/lib/modellerXX_XX/modlib/
if installed with dpkg
or at anaconda/envs/yourenv/lib/modellerXX-XX/modlib/
if installed with conda
. These paths might be different on your system.
MolProbity
Installation instructures for MolProbity are available here: Github MolProbity. Please make sure that after installation, phenix.molprobity
is in your path.
Alignment Software
homelette is, given a query sequence, to automatically search for potential templates and generate sequence alignments. This requires additional software.
Clustal Omega
Clustal Omega is a light and powerful multiple sequence alignment tool. It can be obtained as source code or precompiled from here: Clustal Omega webpage. Please make sure that after installation, clustalo
is in your path.
HHSuite3
Installation instructions for HHSuite3 are available here: Github HHSuite. Please make sure that after installation, hhblits
is in your path.
Databases for HHSuite3
Information on how to obtain the databases is available here: Github HHSuite. The PDB70 database (~25 GB download, ~65 GB extracted) is required for using HHSuite in homelette, while the UniRef30 database (50~ GB download, ~170 GB extracted) is optional. Please make sure that after downloading and extracting the databases that they are in one folder and are named pdb70_*
and UniRef30_*
, respectively.
Docker
One of the best ways to share software and software environments in a reproducible way is using Docker. We have prepared a way to set up a docker image containing homelette and all its dependencies.
Due to the way how MODELLER licenses need to be aquired for each individual user, a two step process to setting up the docker image is required:
The template for the docker image that contains everything except a MODELLER license key will be pulled from DockerHub.
With a valid MODELLER license key, a local image with all dependencies working will be generated.
Note
Due to the numerous dependencies installed in the Docker image, please be aware that the image is quite big (~10 GB).
Note
The databases required for using HHSuite3 are not included in the docker container due to their size.
The following sections will explain how to set up and use the docker image.
Setting up the docker image
A bash script (construct_homelette_image.sh
found in homelette/docker/
) has been provided which automatically pulls the latest version of the homelette_template image from DockerHub and then attempts to construct the local homelette image with the given MODELLER license key. After downloading the script from Github, run
./construct_homelette_image.sh "YOUR MODELLERKEY HERE"
Warning
The local image created by this contains your MODELLLER license key. Similarly, as you would not send your license key to others, please do not share this image with other people, including on DockerHub.
The script will fail and no local image will be constructed if the license key is not accepted by the MODELLER version in the container.
Accessing the docker image
After constructing the local homelette docker image, you can access the docker image as every other as well.
docker run --rm -it homelette
However, to make access a bit simpler, we have written a bash script (homelete.sh
found in homelette/docker/
) to provide different options and modes to access the docker image. There are four different modes available:
./homelette.sh -m tutorial
: This opens an interactive Jupyter Lab version of the tutorials../homelette.sh -m jupyterlab
: This opens an interactive Jupyter Lab session with access to homelette and all dependencies../homelette.sh -m interacive
: This opens an interactive Python interpreter session with access to homelette and all dependencies../homelette.sh -m script
: This allows the user to execute a Python script in the Docker container.
In addition, the script has the ability to make a number of directories from the host machine available to the container. Please check out ./homelette.sh -h
for more details. All containers generated by this script will be removed after termination.
Tutorials
We have prepared a series of 7 tutorials which will teach the interested user everything about using the homelette package. This is a great place to get started with homelette.
For a more interactive experience, all tutorials are available as Jupyter Notebooks through our Docker container.
Tutorial 1: Basics
[1]:
import homelette as hm
Introduction
Welcome to the first tutorial on how to use the homelette
package. In this example, we will generate homology models using both modeller
[1,2] and ProMod3
[3,4] and then evaluate them using the DOPE score [5].
homelette
is a Python package that delivers a unified interface to various homology modelling and model evaluation software. It is also easily customizable and extendable. Through a series of 7 tutorials, you will learn how to work with homelette
as well as how to extend and adapt it to your specific needs.
In tutorial 1, you will learn how to:
Import an alignment.
Generate homology models using a predefined routine with
modeller
.Generate homology models using a predefined routine with
ProMod3
.Evaluate these models.
In this example, we will generate a protein structure for the RBD domain of ARAF. ARAF is a RAF kinase important in MAPK signalling. As a template, we will choose a close relative of ARAF called BRAF, specifically the structure with the PDB code 3NY5.
All files necessary for running this tutorial are already prepared and deposited in the following directory: homelette/example/data/
. If you execute this tutorial from homelette/example/
, you don’t have to adapt any of the paths.
homelette
comes with an extensive documentation. You can either check out our online documentation, compile a local version of the documentation in homelette/docs/
or use the help()
function in Python.
Alignment
The basis for a good homology model is a good alignment between your target and your template(s). There are many ways to generate alignments. Depending on the scope of your project, you might want to generate extensive, high-quality multiple sequence alignments from annotated sequence libraries of your sequences of interest using specific software such as t-coffee [6,7], or get a web service such as HH-Pred [8,9] to search for potential templates and align them.
For this example, we have already provided an alignment for you.
homelette
has its own Alignment
class which is used to work with alignments. You can import alignments from different file types, write alignments to different file types, select a subset of sequences, calculate sequence identity and print the alignment to screen. For more information, please check out the documentation.
[2]:
# read in the alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')
# print to screen to check alignment
aln.print_clustal(line_wrap=70)
ARAF ---GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLIKGRKTVTAWDTAIAPLDGEE
3NY5 HQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ---KKPIGWDTDISWLTGEE
ARAF LIVEVL------
3NY5 LHVEVLENVPLT
The template aligns nicely to our target. We can also check how much sequence identity these two sequences share:
[3]:
# calculate identity
aln.calc_identity_target('ARAF')
[3]:
sequence_1 | sequence_2 | identity | |
---|---|---|---|
0 | ARAF | 3NY5 | 57.53 |
The two sequences share a high amount of sequence identity, which is a good sign that our homology model might be reliable.
modeller
expects the sequences handed to it to be annotated to a minimal degree. It is usually a good idea to annotate any template given to modeller
in addition to the required PDB identifier with beginning and end residues and chains. This can be done as such:
[4]:
# annotate the alignment
aln.get_sequence('ARAF').annotate(seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(seq_type = 'structure',
pdb_code = '3NY5',
begin_res = '1',
begin_chain = 'A',
end_res = '81',
end_chain = 'A')
For more information on the sequence annotation, please check the documentation.
Template Structures
For the sake of consistency, we recommend adjusting the residue count to start with residue 1 for each model and ignore missing residues. A good tool for handling PDB structures is pdb-tools
(available here) [10].
Model Generation
After importing our alignment, checking it manually, calculating sequence identities and annotating the sequences, as well as taking about the templates we are using, we are now able to proceed with the model generation.
Before starting modelling and evaluation, we need to set up a Task
object. The purpose of Task
objects is to simplify the interface to modelling and evaluation methods. Task
objects are alignment-specific and target-specific.
[5]:
# set up task object
t = hm.Task(
task_name = 'Tutorial1',
target = 'ARAF',
alignment = aln,
overwrite = True)
Upon initialization, the task object will check if there is a folder in the current working directory that corresponds to the given task_name
. If no such folder is available, a new one will be created.
After initialization of the Task object, we can start with homology modelling. For this, we use the execute_routine
function of the task object, which applies the chosen homology modelling method with the chosen target, alignment and template(s).
[6]:
# generate models with modeller
t.execute_routine(
tag = 'example_modeller',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single')
It is possible to use the same Task
object to create models from multiple different routine-template combinations.
[7]:
# generate models with promod3
t.execute_routine(
tag = 'example_promod3',
routine = hm.routines.Routine_promod3,
templates = ['3NY5'],
template_location = './data/single')
Model Evaluation
Similarly to modelling, model evaluation is performed through the evaluate_models
function of the Task
object. This function is an easy interface to perform one or more evaluation methods on the models deposited in the task object.
[8]:
# perform evaluation
t.evaluate_models(hm.evaluation.Evaluation_dope)
The Task.get_evaluation
function retrieves the evaluation for all models in the Task
object as a pandas
data frame.
[9]:
t.get_evaluation()
[9]:
model | tag | routine | dope | dope_z_score | |
---|---|---|---|---|---|
0 | example_modeller_1.pdb | example_modeller | automodel_default | -7274.457520 | -1.576995 |
1 | example_promod3_1.pdb | example_promod3 | promod3 | -7642.868652 | -1.934412 |
For more details on the available evaluation methods please check out the documentation and the Tutorial 3.
Further Reading
Congratulations, you are now familiar with the basic functionality of homelette
. You can now load an alignment, are familiar with the Task
object and can perform homology modelling and evaluate your models.
Please note that there are other, more advanced tutorials, which will teach you more about how to use homelette
:
Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 6: Learn about modelling protein complexes.
Tutorial 7: Learn about assembling custom pipelines.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
[3] Biasini, M., Schmidt, T., Bienert, S., Mariani, V., Studer, G., Haas, J., Johner, N., Schenk, A. D., Philippsen, A., & Schwede, T. (2013). OpenStructure: An integrated software framework for computational structural biology. Acta Crystallographica Section D: Biological Crystallography, 69(5), 701–709. https://doi.org/10.1107/S0907444913007051
[4] Studer, G., Tauriello, G., Bienert, S., Biasini, M., Johner, N., & Schwede, T. (2021). ProMod3—A versatile homology modelling toolbox. PLOS Computational Biology, 17(1), e1008667. https://doi.org/10.1371/JOURNAL.PCBI.1008667
[5] Shen, M., & Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Science, 15(11), 2507–2524. https://doi.org/10.1110/ps.062416606
[6] Notredame, C., Higgins, D. G., & Heringa, J. (2000). T-coffee: A novel method for fast and accurate multiple sequence alignment. Journal of Molecular Biology, 302(1), 205–217. https://doi.org/10.1006/jmbi.2000.4042
[7] Wallace, I. M., O’Sullivan, O., Higgins, D. G., & Notredame, C. (2006). M-Coffee: Combining multiple sequence alignment methods with T-Coffee. Nucleic Acids Research, 34(6), 1692–1699. https://doi.org/10.1093/nar/gkl091
[8] Söding, J., Biegert, A., & Lupas, A. N. (2005). The HHpred interactive server for protein homology detection and structure prediction. Nucleic Acids Research, 33(suppl_2), W244–W248. https://doi.org/10.1093/NAR/GKI408
[9] Zimmermann, L., Stephens, A., Nam, S. Z., Rau, D., Kübler, J., Lozajic, M., Gabler, F., Söding, J., Lupas, A. N., & Alva, V. (2018). A Completely Reimplemented MPI Bioinformatics Toolkit with a New HHpred Server at its Core. Journal of Molecular Biology, 430(15), 2237–2243. https://doi.org/10.1016/J.JMB.2017.12.007
[10] Rodrigues, J. P. G. L. M., Teixeira, J. M. C., Trellet, M., & Bonvin, A. M. J. J. (2018). pdb-tools: a swiss army knife for molecular structures. F1000Research 2018 7:1961, 7, 1961. https://doi.org/10.12688/f1000research.17456.1
Session Info
[10]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
pandas 1.5.3
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:34
Tutorial 2: Modelling
[1]:
import os
import homelette as hm
Introduction
Welcome to the second tutorial for homelette
. In this tutorial, we will further explore the already implemented method to generate homology models.
Currently, the following software packages for generating homology models have been integrated in the homelette
homology modelling interface:
modeller
: A robust package for homology modelling with a long history which is widely used [1,2]altmod
: A modification to the standardmodeller
modelling procedure that has been reported to increase the quality of models [3]ProMod3
: The modelling engine behind the popular SwissModel web platform [4,5]
Specifically, the following routines are implemented in homelette
. For more details on the individual routines, please check the documentation or their respective docstring.
routines.Routine_automodel_default
routines.Routine_automodel_slow
routines.Routine_altmod_default
routines.Routine_altmod_slow
routines.Routine_promod3
In this example, we will generate models for the RBD domain of ARAF. ARAF is a RAF kinase important in MAPK signalling. As a template, we will choose a close relative of ARAF called BRAF, specifically the structure with the PDB code 3NY5.
All files necessary for running this tutorial are already prepared and deposited in the following directory: homelette/example/data/
. If you execute this tutorial from homelette/example/
, you don’t have to adapt any of the paths.
homelette
comes with an extensive documentation. You can either check out our online documentation, compile a local version of the documentation in homelette/docs/
with sphinx
or use the help()
function in Python.
Alignment
For this tutorial, we will use the same alignment and template as for Tutorial 1.
[2]:
# read in the alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')
# print to screen to check alignment
aln.print_clustal(line_wrap=70)
ARAF ---GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLIKGRKTVTAWDTAIAPLDGEE
3NY5 HQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ---KKPIGWDTDISWLTGEE
ARAF LIVEVL------
3NY5 LHVEVLENVPLT
[3]:
# annotate the alignment
aln.get_sequence('ARAF').annotate(seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(seq_type = 'structure',
pdb_code = '3NY5',
begin_res = '1',
begin_chain = 'A',
end_res = '81',
end_chain = 'A')
Model Generation using routines
The building blocks in homelette
that take care of model generation are called Routines. There is a number of pre-defined routines, and it is also possible to construct custom routines (see Tutorial 4). Every routine in homelette
expects a number of identical arguments, while some can have a few optional ones as well.
[4]:
?hm.routines.Routine_automodel_default
Init signature:
hm.routines.Routine_automodel_default(
alignment: Type[ForwardRef('Alignment')],
target: str,
templates: Iterable,
tag: str,
n_threads: int = 1,
n_models: int = 1,
) -> None
Docstring:
Class for performing homology modelling using the automodel class from
modeller with a default parameter set.
Parameters
----------
alignment : Alignment
The alignment object that will be used for modelling
target : str
The identifier of the protein to model
templates : Iterable
The iterable containing the identifier(s) of the template(s) used
for the modelling
tag : str
The identifier associated with a specific execution of the routine
n_threads : int
Number of threads used in model generation (default 1)
n_models : int
Number of models generated (default 1)
Attributes
----------
alignment : Alignment
The alignment object that will be used for modelling
target : str
The identifier of the protein to model
templates : Iterable
The iterable containing the identifier(s) of the template(s) used for
the modelling
tag : str
The identifier associated with a specific execution of the routine
n_threads : int
Number of threads used for model generation
n_models : int
Number of models generated
routine : str
The identifier associated with a specific routine
models : list
List of models generated by the execution of this routine
Raises
------
ImportError
Unable to import dependencies
Notes
-----
The following modelling parameters can be set when initializing this
Routine object:
* n_models
* n_threads
The following modelling parameters are set for this class:
+-----------------------+---------------------------------------+
| modelling | value |
| parameter | |
+=======================+=======================================+
| model_class | modeller.automodel.automodel |
+-----------------------+---------------------------------------+
| library_schedule | modeller.automodel.autosched.normal |
+-----------------------+---------------------------------------+
| md_level | modeller.automodel.refine.very_fast |
+-----------------------+---------------------------------------+
| max_var_iterations | 200 |
+-----------------------+---------------------------------------+
| repeat_optmization | 1 |
+-----------------------+---------------------------------------+
File: /usr/local/src/homelette-1.4/homelette/routines.py
Type: type
Subclasses:
The following arguments are required for all pre-defined routines:
alignment
: The alignment object used for modelling.target
: The identifier of the target sequence in the alignment objecttemplates
: An iterable containing the identifier(s) of the templates for this modelling routine.homelette
expects that templates are uniquely identified by their identifier in the alignment and in the template PDB file(s). Routines based onmodeller
work with one or multiple templates, whereasRoutine_promod3
only accepts a single template per run.tag
: Each executed routine is given a tag which will be used to name the generated models.
In addition, pre-defined routines expect the template PDBs to be present in the current working directory.
The routine Routine_automodel_default
has two optional arguments:
n_models
: the number of models that should be produced on this run, as routines based onmodeller
are able to produce an arbitary number of models.n_threads
: enable mulit-threading for the execution of this routine. For more information on parallelization inhomelette
, please check out Tutorial 5.
While it is generally recommended to execute routines using Task
objects (see next section), it is also possible to execute them directly. For doing this, since the template file has to be in the curent working directory, we quickly change working directory to a prepared directory where we can execute the routine (this code assumes that your working directory is homelette/examples
.
[5]:
# change directory
os.chdir('data/single')
# print content of directory to screen
print('Files before modelling:\n' + ' '.join(os.listdir()) + '\n\n')
# perform modelling
routine = hm.routines.Routine_automodel_default(
alignment=aln,
target='ARAF',
templates=['3NY5'],
tag='model')
routine.generate_models()
print('Files after modelling:\n' + ' '.join(os.listdir()) + '\n')
# remove model
os.remove('model_1.pdb')
# change back to tutorial directory
os.chdir('../..')
Files before modelling:
3NY5.pdb aln_1.fasta_aln 4G0N.pdb
Files after modelling:
model_1.pdb 3NY5.pdb aln_1.fasta_aln 4G0N.pdb
Model Generation using Task
and routines
homelette
has Task
objects that allow for easier use of Routines and Evaluations (see also Tutorial 3). Task
objects help to direct and organize modelling pipelines. It is strongly recommended to use Task
objects to execute routines and evaluations.
For more information on Task
objects, please check out the documentation or Tutorial 1.
[6]:
# set up task object
t = hm.Task(
task_name = 'Tutorial2',
target = 'ARAF',
alignment = aln,
overwrite = True)
Using the Task
object, we can now begin to generate our models with different routines using the Task.execute_routine
method.
[7]:
?hm.Task.execute_routine
Signature:
hm.Task.execute_routine(
self,
tag: str,
routine: Type[ForwardRef('routines.Routine')],
templates: Iterable,
template_location: str = '.',
**kwargs,
) -> None
Docstring:
Generates homology models using a specified modelling routine
Parameters
----------
tag : str
The identifier associated with this combination of routine and
template(s). Has to be unique between all routines executed by the
same task object
routine : Routine
The routine object used to generate the models
templates : list
The iterable containing the identifier(s) of the template(s) used
for model generation
template_location : str, optional
The location of the template PDB files. They should be named
according to their identifiers in the alignment (i.e. for a
sequence named "1WXN" to be used as a template, it is expected that
there will be a PDB file named "1WXN.pdb" in the specified template
location (default is current working directory)
**kwargs
Named parameters passed directly on to the Routine object when the
modelling is performed. Please check the documentation in order to
make sure that the parameters passed on are available with the
Routine object you intend to use
Returns
-------
None
File: /usr/local/src/homelette-1.4/homelette/organization.py
Type: function
As we can see, Task.execute_routine
expects a number of arguments from the user:
tag
: Each executed routine is given a tag which will be used to name the generated models. This is useful for differentiating between different routines executed by the sameTask
, for example if different templates are used.routine
: Here the user can set which routine will be used for generating the homology model(s), arguably the most important setting.templates
: An iterable containing the identifier(s) of the templates for this modelling routine.homelette
expects that templates are uniquely identified by their identifier(s) in the alignment and in the template location.template_location
: The folder where the PDB file(s) used as template(s) are found.
We are generating some models with the pre-defined routines of homelette
:
[8]:
# model generation with modeller
t.execute_routine(
tag = 'example_modeller',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single')
# model generation with altmod
t.execute_routine(
tag = 'example_altmod',
routine = hm.routines.Routine_altmod_default,
templates = ['3NY5'],
template_location = './data/single')
# model generation with promod3
t.execute_routine(
tag = 'example_promod3',
routine = hm.routines.Routine_promod3,
templates = ['3NY5'],
template_location = './data/single')
As mentioned before, some modelling routines have optional arguments, such as n_models
for Routine_autmodel_default
. We can pass these optional arguments to Task.execute_routine
which passes them on the routine selected:
[9]:
# multiple model generation with altmod
t.execute_routine(
tag = 'example_modeller_more_models',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single',
n_models = 10)
Models generated using Task
objects are stored as Model
objects in the Task
:
[10]:
t.models
[10]:
[<homelette.organization.Model at 0x7f421f7f9280>,
<homelette.organization.Model at 0x7f421f7cf7f0>,
<homelette.organization.Model at 0x7f421f8f4370>,
<homelette.organization.Model at 0x7f421f8dfca0>,
<homelette.organization.Model at 0x7f421f8df2e0>,
<homelette.organization.Model at 0x7f421f8da2b0>,
<homelette.organization.Model at 0x7f421f8da400>,
<homelette.organization.Model at 0x7f421f8da370>,
<homelette.organization.Model at 0x7f421f806220>,
<homelette.organization.Model at 0x7f421f806cd0>,
<homelette.organization.Model at 0x7f421f806a00>,
<homelette.organization.Model at 0x7f421f806f10>,
<homelette.organization.Model at 0x7f421f806280>]
In conclusion, we have learned how to use a single Task
object to generate models with different modelling routines. We have also learned how to pass optional arguments on to the executed routines.
In this example, the target, the alignment and the templates were kept identical. Varying the templates would be straight forward, under the condition that other templates are included in the alignment. For varying alignments and targets, new Task
objects would need to be created. This is a design choice that is meant to encourage users to try out different routines or templates/template combinations. It is recommended when using different routines or multiple templates to indicate this
using the tag
argument of Task.execute_routine
(i.e. tag='automodel_3NY5')
. Similarly, using a single Task
object for multiple targets or alignments is discouraged and we recommend to utilize multiple Task
objects for these modelling approaches.
Further Reading
You are now familiar with model generation in homelette
.
Please note that there are other tutorials, which will teach you more about how to use homelette
:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 6: Learn about modelling protein complexes.
Tutorial 7: Learn about assembling custom pipelines.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
[3] Janson, G., Grottesi, A., Pietrosanto, M., Ausiello, G., Guarguaglini, G., & Paiardini, A. (2019). Revisiting the “satisfaction of spatial restraints” approach of MODELLER for protein homology modeling. PLoS Computational Biology, 15(12), e1007219. https://doi.org/10.1371/journal.pcbi.1007219
[4] Biasini, M., Schmidt, T., Bienert, S., Mariani, V., Studer, G., Haas, J., Johner, N., Schenk, A. D., Philippsen, A., & Schwede, T. (2013). OpenStructure: An integrated software framework for computational structural biology. Acta Crystallographica Section D: Biological Crystallography, 69(5), 701–709. https://doi.org/10.1107/S0907444913007051
[5] Studer, G., Tauriello, G., Bienert, S., Biasini, M., Johner, N., & Schwede, T. (2021). ProMod3—A versatile homology modelling toolbox. PLOS Computational Biology, 17(1), e1008667. https://doi.org/10.1371/JOURNAL.PCBI.1008667
Session Info
[11]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
pandas 1.5.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:35
Tutorial 3: Evaluation
[1]:
import homelette as hm
Introduction
Welcome to the third tutorial for homelette
. In this tutorial, we will explore which evaluation metrics are implemented in homelette
and how to use them.
Model evaluation is an important step in any homology modelling procedure. In most practical scenarios, you will end up with more than one possible model and have to decide which one is “best”. Obtaining multiple models can be the result of trying out different templates or combinations of templates, different algorithms generating the models, or due to using an algorithm which can generate multiple models.
The following evaluation metrics are implemented in homelette
:
evaluation.Evaluation_dope
: DOPE score frommodeller
[1]evaluation.Evaluation_soap_protein
: SOAP score frommodeller
for the evaluation of single proteins [2]evaluation.Evaluation_soap_pp
: SOAP score frommodeller
for the evaluation of protein complexes [2]evaluation.Evaluation_qmean4
: QMEAN4 score [3,4]evaluation.Evaluation_qmean6
: QMEAN6 score [3,4]evaluation.Evaluation_qmeandisco
: QMEAN DisCo score [3,4,5]evaluation.Evaluation_mol_probity
: MolProbity score for the structural evaluation of proteins [6,7,8]
All files necessary for running this tutorial are already prepared and deposited in the following directory: homelette/example/data/
. If you execute this tutorial from homelette/example/
, you don’t have to adapt any of the paths.
homelette
comes with an extensive documentation. You can either check out our online documentation, compile a local version of the documentation in homelette/docs/
with sphinx
or use the help()
function in Python.
Model Generation
In order to have a few models to evaluate, we will briefly generate some models of ARAF as we have done in previous tutorials (please check Tutorial 1 and Tutorial 2 for more information on this part).
[2]:
# get alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')
# annotate the alignment
aln.get_sequence('ARAF').annotate(
seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(
seq_type = 'structure',
pdb_code = '3NY5',
begin_res = '1',
begin_chain = 'A',
end_res = '81',
end_chain = 'A')
# initialize task object
t = hm.Task(
task_name = 'Tutorial3',
target = 'ARAF',
alignment = aln,
overwrite = True)
# generate models with modeller
t.execute_routine(
tag = 'modeller',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single',
n_models = 5)
# generate models with altmod
t.execute_routine(
tag = 'altmod',
routine = hm.routines.Routine_altmod_default,
templates = ['3NY5'],
template_location = './data/single',
n_models = 5)
We now have generated 10 models, 5 generated with modeller
and another 5 generated with altmod
.
Model Evaluation using evaluation
Similar to routines, evaluations can be executed on their own, although it is recommended to use an interface through the Task
object (see next section). For showcasing how an evaluation can be executed on its own, we will take one of the previously generated models as an example:
[3]:
# example model
model = t.models[0]
model
[3]:
<homelette.organization.Model at 0x7f7f681ae250>
Every Model
object has an Model.evaluation
attribute where information about the model and its evaluations are collected:
[4]:
model.evaluation
[4]:
{'model': 'modeller_1.pdb', 'tag': 'modeller', 'routine': 'automodel_default'}
After performing an evaluation, this dictionary will be updated with the results of the evaluation:
[5]:
hm.evaluation.Evaluation_dope(model, quiet=True)
model.evaluation
[5]:
{'model': 'modeller_1.pdb',
'tag': 'modeller',
'routine': 'automodel_default',
'dope': -7216.8564453125,
'dope_z_score': -1.5211129532811163}
The interface to evaluations is relatively simple:
[6]:
?hm.evaluation.Evaluation_dope
Init signature:
hm.evaluation.Evaluation_dope(
model: Type[ForwardRef('Model')],
quiet: bool = False,
) -> None
Docstring:
Class for evaluating a model with DOPE score.
Will dump the following entries to the model.evaluation dictionary:
* dope
* dope_z_score
Parameters
----------
model : Model
The model object to evaluate
quiet : bool
If True, will perform evaluation with suppressing stdout (default
False). Needs to be False for running it asynchronously, as done
when running Task.evaluate_models with multple cores
Attributes
----------
model : Model
The model object to evaluate
output : dict
Dictionary that all outputs will be dumped into
Raises
------
ImportError
Unable to import dependencies
Notes
-----
DOPE is a staticial potential for the evaluation of homology models [1]_.
For further information, please check the modeller documentation or the
associated publication.
References
----------
.. [1] Shen, M., & Sali, A. (2006). Statistical potential for assessment
and prediction of protein structures. Protein Science, 15(11),
2507–2524. https://doi.org/10.1110/ps.062416606
File: /usr/local/src/homelette-1.4/homelette/evaluation.py
Type: type
Subclasses:
Evaluations take only two arguments: - model
: A Model
object - quiet
: A boolean value determining whether any output to the console should be suppressed.
Unlike routines, evaluations are executed as soon as the object is initialized.
Model Evaluation using Task
and evaluation
Using the interface to evaluations that is implemented in Task
objects has several advantages: it is possible to evaluate multiple models with multiple evaluation metrics in one command. In addition, multi-threading can be enabled (see Tutorial 5 for more details). The method to run evaluations with a Task
object is called evaluate_models
.
[7]:
?hm.Task.evaluate_models
Signature:
hm.Task.evaluate_models(
self,
*args: Type[ForwardRef('evaluation.Evaluation')],
n_threads: int = 1,
) -> None
Docstring:
Evaluates models using one or multiple evaluation metrics
Parameters
----------
*args: Evaluation
Evaluation objects that will be applied to the models
n_threads : int, optional
Number of threads used for model evaluation (default is 1, which
deactivates parallelization)
Returns
-------
None
File: /usr/local/src/homelette-1.4/homelette/organization.py
Type: function
[8]:
# running dope and soap at the same time
t.evaluate_models(hm.evaluation.Evaluation_dope,
hm.evaluation.Evaluation_soap_protein)
After running evaluations, output of all Model.evaluation
can be compiled to a pandas
data frame as such:
[9]:
t.get_evaluation()
[9]:
model | tag | routine | dope | dope_z_score | soap_protein | |
---|---|---|---|---|---|---|
0 | modeller_1.pdb | modeller | automodel_default | -7216.856445 | -1.521113 | -44167.968750 |
1 | modeller_2.pdb | modeller | automodel_default | -7274.457520 | -1.576995 | -45681.269531 |
2 | modeller_3.pdb | modeller | automodel_default | -7126.735352 | -1.433681 | -43398.992188 |
3 | modeller_4.pdb | modeller | automodel_default | -7225.522461 | -1.529520 | -42942.808594 |
4 | modeller_5.pdb | modeller | automodel_default | -7128.661621 | -1.435550 | -41418.894531 |
5 | altmod_1.pdb | altmod | altmod_default | -8148.456055 | -2.424912 | -53440.839844 |
6 | altmod_2.pdb | altmod | altmod_default | -8187.364258 | -2.462659 | -49991.304688 |
7 | altmod_3.pdb | altmod | altmod_default | -8202.568359 | -2.477409 | -53909.824219 |
8 | altmod_4.pdb | altmod | altmod_default | -8170.016602 | -2.445829 | -52208.402344 |
9 | altmod_5.pdb | altmod | altmod_default | -8145.944336 | -2.422475 | -50776.855469 |
On the combination of different evaluation metrics
Oftentimes it is useful to use different metrics to evaluate models. However, that produces the problem of having multiple metrics to base a decision on. There are multiple solutions to this problem, all of them with their own advantages and disadvantes. We want to mention the combination of z-scores of the different metrics and the combination of metrics by borda count.
In the following, we show how to combine multiple scores to one borda score. In short, borda count is an agglomeration of ranks in the different individual metrics to one score.
Note
Be careful because, for some metrics, lower values are better (DOPE, SOAP, MolProbity), but for others higher values are better (QMEAN).
[10]:
df = t.get_evaluation()
df = df.drop(labels=['routine', 'tag'], axis=1)
# rank by dope and soap
df['rank_dope'] = df['dope'].rank()
df['rank_soap'] = df['soap_protein'].rank()
# calculate points based on rank
n = df.shape[0]
df['points_dope'] = n - df['rank_dope']
df['points_soap'] = n - df['rank_soap']
df
[10]:
model | dope | dope_z_score | soap_protein | rank_dope | rank_soap | points_dope | points_soap | |
---|---|---|---|---|---|---|---|---|
0 | modeller_1.pdb | -7216.856445 | -1.521113 | -44167.968750 | 8.0 | 7.0 | 2.0 | 3.0 |
1 | modeller_2.pdb | -7274.457520 | -1.576995 | -45681.269531 | 6.0 | 6.0 | 4.0 | 4.0 |
2 | modeller_3.pdb | -7126.735352 | -1.433681 | -43398.992188 | 10.0 | 8.0 | 0.0 | 2.0 |
3 | modeller_4.pdb | -7225.522461 | -1.529520 | -42942.808594 | 7.0 | 9.0 | 3.0 | 1.0 |
4 | modeller_5.pdb | -7128.661621 | -1.435550 | -41418.894531 | 9.0 | 10.0 | 1.0 | 0.0 |
5 | altmod_1.pdb | -8148.456055 | -2.424912 | -53440.839844 | 4.0 | 2.0 | 6.0 | 8.0 |
6 | altmod_2.pdb | -8187.364258 | -2.462659 | -49991.304688 | 2.0 | 5.0 | 8.0 | 5.0 |
7 | altmod_3.pdb | -8202.568359 | -2.477409 | -53909.824219 | 1.0 | 1.0 | 9.0 | 9.0 |
8 | altmod_4.pdb | -8170.016602 | -2.445829 | -52208.402344 | 3.0 | 3.0 | 7.0 | 7.0 |
9 | altmod_5.pdb | -8145.944336 | -2.422475 | -50776.855469 | 5.0 | 4.0 | 5.0 | 6.0 |
[11]:
# calculate borda score and borda rank
df['borda_score'] = df['points_dope'] + df['points_soap']
df['borda_rank'] = df['borda_score'].rank(ascending=False)
df = df.drop(labels=['rank_dope', 'rank_soap', 'points_dope', 'points_soap'], axis=1)
df.sort_values(by='borda_rank')
[11]:
model | dope | dope_z_score | soap_protein | borda_score | borda_rank | |
---|---|---|---|---|---|---|
7 | altmod_3.pdb | -8202.568359 | -2.477409 | -53909.824219 | 18.0 | 1.0 |
5 | altmod_1.pdb | -8148.456055 | -2.424912 | -53440.839844 | 14.0 | 2.5 |
8 | altmod_4.pdb | -8170.016602 | -2.445829 | -52208.402344 | 14.0 | 2.5 |
6 | altmod_2.pdb | -8187.364258 | -2.462659 | -49991.304688 | 13.0 | 4.0 |
9 | altmod_5.pdb | -8145.944336 | -2.422475 | -50776.855469 | 11.0 | 5.0 |
1 | modeller_2.pdb | -7274.457520 | -1.576995 | -45681.269531 | 8.0 | 6.0 |
0 | modeller_1.pdb | -7216.856445 | -1.521113 | -44167.968750 | 5.0 | 7.0 |
3 | modeller_4.pdb | -7225.522461 | -1.529520 | -42942.808594 | 4.0 | 8.0 |
2 | modeller_3.pdb | -7126.735352 | -1.433681 | -43398.992188 | 2.0 | 9.0 |
4 | modeller_5.pdb | -7128.661621 | -1.435550 | -41418.894531 | 1.0 | 10.0 |
The model with the highest borda score or the lowest borda count is the best model according to the combination of DOPE and SOAP scores.
Further reading
You are now familiar with using the implemented evaluation features of homelette
. For further reading, please consider checking out the other tutorials:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 6: Learn about modelling protein complexes.
Tutorial 7: Learn about assembling custom pipelines.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Shen, M., & Sali, A. (2006). Statistical potential for assessment and prediction of protein structures. Protein Science, 15(11), 2507–2524. https://doi.org/10.1110/ps.062416606
[2] Dong, G. Q., Fan, H., Schneidman-Duhovny, D., Webb, B., Sali, A., & Tramontano, A. (2013). Optimized atomic statistical potentials: Assessment of protein interfaces and loops. Bioinformatics, 29(24), 3158–3166. https://doi.org/10.1093/bioinformatics/btt560
[3] Benkert, P., Tosatto, S. C. E., & Schomburg, D. (2008). QMEAN: A comprehensive scoring function for model quality assessment. Proteins: Structure, Function and Genetics, 71(1), 261–277. https://doi.org/10.1002/prot.21715
[4] Benkert, P., Biasini, M., & Schwede, T. (2011). Toward the estimation of the absolute quality of individual protein structure models. Bioinformatics, 27(3), 343–350. https://doi.org/10.1093/bioinformatics/btq662
[5] Studer, G., Rempfer, C., Waterhouse, A. M., Gumienny, R., Haas, J., & Schwede, T. (2020). QMEANDisCo-distance constraints applied on model quality estimation. Bioinformatics, 36(6), 1765–1771. https://doi.org/10.1093/bioinformatics/btz828
[6] Davis, I. W., Leaver-Fay, A., Chen, V. B., Block, J. N., Kapral, G. J., Wang, X., Murray, L. W., Arendall, W. B., Snoeyink, J., Richardson, J. S., & Richardson, D. C. (2007). MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research, 35(suppl_2), W375–W383. https://doi.org/10.1093/NAR/GKM216
[7] Chen, V. B., Arendall, W. B., Headd, J. J., Keedy, D. A., Immormino, R. M., Kapral, G. J., Murray, L. W., Richardson, J. S., & Richardson, D. C. (2010). MolProbity: All-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D: Biological Crystallography, 66(1), 12–21. https://doi.org/10.1107/S0907444909042073
[8] Williams, C. J., Headd, J. J., Moriarty, N. W., Prisant, M. G., Videau, L. L., Deis, L. N., Verma, V., Keedy, D. A., Hintze, B. J., Chen, V. B., Jain, S., Lewis, S. M., Arendall, W. B., Snoeyink, J., Adams, P. D., Lovell, S. C., Richardson, J. S., & Richardson, D. C. (2018). MolProbity: More and better reference data for improved all-atom structure validation. Protein Science, 27(1), 293–315. https://doi.org/10.1002/pro.3330
Session Info
[12]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
pandas 1.5.3
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:37
Tutorial 4: Extending homelette
[1]:
import homelette as hm
import contextlib
import glob
import os.path
import sys
from modeller import environ, Selection
from modeller.automodel import LoopModel
Introduction
Welcome to the forth tutorial on homelette
. In this tutorial, we will discuss how to implement custom building blocks, either for generating or for evaluating models. These custom building blocks can be integrated in homology modelling pipelines.
This is probably the most important tutorial in the series. After this tutorial, you will be able to implement your own routines into the homelette
framework, which gives you complete control over the homology modelling pipelines you want to establish!
Please note that we encourage users to share custom routines and evaluation metrics if they think they might be useful for the community. In our online documentation, there is a dedicated section for these contributions. If you are interested, please contact us on GitHub or via email.
Alignment
For this tutorial, we are using the same alignment as in Tutorial 1. Identical to Tutorial 1, the alignment is imported and annotated and a Task
object is created.
[2]:
# read in the alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')
# annotate the alignment
aln.get_sequence('ARAF').annotate(
seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(
seq_type = 'structure',
pdb_code = '3NY5',
begin_res = '1',
begin_chain = 'A',
end_res = '81',
end_chain = 'A')
# initialize task object
t = hm.Task(
task_name = 'Tutorial4',
target = 'ARAF',
alignment = aln,
overwrite = True)
Defining custom routines
As an example for a custom routine, we will implement a LoopModel
class from modeller
[1,2] loosely following this tutorial on the modeller
web page (in the section Loop Refining).
[3]:
class Routine_loopmodel(hm.routines.Routine): # (1)
'''
Custom routine for modeller loop modelling.
'''
def __init__(self, alignment, target, templates, tag, n_models=1, n_loop_models=1): # (2)
hm.routines.Routine.__init__(self, alignment, target, templates, tag)
self.routine = 'loopmodel' # string identifier of routine
self.n_models = n_models
self.n_loop_models = n_loop_models
def generate_models(self): # (3)
# (4) process alignment
self.alignment.select_sequences([self.target] + self.templates)
self.alignment.remove_redundant_gaps()
# write alignemnt to temporary file
self.alignment.write_pir('.tmp.pir')
# (5) define custom loop model class
class MyLoop(LoopModel):
# set residues that will be refined by loop modelling
def select_loop_atoms(self):
return Selection(self.residue_range('18:A', '22:A'))
with contextlib.redirect_stdout(None): # (6) suppress modeller output to stdout
# (7) set up modeller environment
env = environ()
env.io.hetatm = True
# initialize model
m = MyLoop(env,
alnfile='.tmp.pir',
knowns=self.templates,
sequence=self.target)
# set modelling parameters
m.blank_single_chain = False
m.starting_model = 1
m.ending_model = self.n_models
m.loop.starting_model = 1
m.loop.ending_model = self.n_loop_models
# make models
m.make()
# (8) capture output
for pdb in glob.glob('{}.BL*.pdb'.format(self.target)):
self.models.append(
hm.Model(os.path.realpath(os.path.expanduser(pdb)),
self.tag, self.routine))
# (9) rename files with method from hm.routines.Routine
self._rename_models()
# (10) clean up
self._remove_files(
'{}.B99*.pdb'.format(self.target),
'{}.D00*'.format(self.target),
'{}.DL*'.format(self.target),
'{}.IL*'.format(self.target),
'{}.ini'.format(self.target),
'{}.lrsr'.format(self.target),
'{}.rsr'.format(self.target),
'{}.sch'.format(self.target),
'.tmp*')
The lines of code in the definition of the custom routine above that are marked with numbers get special comments here:
Our custom routine in this example inherits from a parent class
Routine
defined inhomelette
. This is not strictly necessary, however, the parent class has a few useful functions already implemented that we will make use of (see steps 2, 9, 10)Every routine needs to accept these arguments:
alignment
,target
,templates
,tag
. In our case, we just hand them through to the parent methodRoutine.__init__
that saves them as attributes, as well as introduces the attributeself.models
where models will be deposited after generation.Every routine needs a
generate_models
method. Usually, functionality for, you guessed it, model generation is packed in there.modeller
requires the aligment as a file in PIR format. The following few lines of code format the alignment and then produce the required file.The following lines follow closely the
modeller
tutorial for loop modelling. This part implements a customLoopModel
class that defines a specific set of residue to be considered for loop modelling.modeller
writes a lot of output to stdout, and usingcontextlib
is a way to suppress this output. If you want to see all the output frommodeller
, either delete thewith
statement or writewith contextlib.redirect_stdout(sys.stdout):
instead.The following lines follow closely the
modeller
tutorial for loop modelling. This part initializes the model and generates the models requested.The final models generated will be called
ARAF.BL00010001.pdb
and so on. These lines of code find these PDB files and add them to theRoutine_loopmodel.models
list asModel
s. After execution by aTask
objects,Model
objects in this list will be added to theTask.models
list.Models generated will be renamed according to the tag given using the parent class method
Routine._rename_models
.Temporary files from modeller as well as the temporary alignment file are removed from the folder using the parent class method
Routine._remove_files
.
Now, after implementing the routine, let’s try it out in practice. As explained in Tutorial 2, we will be using the Task.execute_routine
interface for that:
[4]:
# perform modelling
t.execute_routine(
tag = 'custom_loop',
routine = Routine_loopmodel,
templates = ['3NY5'],
template_location = './data/single',
n_models = 2,
n_loop_models = 2)
[5]:
# check generated models
t.models
[5]:
[<homelette.organization.Model at 0x7f211ff3fa30>,
<homelette.organization.Model at 0x7f211ff54a30>,
<homelette.organization.Model at 0x7f211ff54d90>,
<homelette.organization.Model at 0x7f211ff54e20>]
In practice, a valid routine only needs to adhere to a small number of formal criteria to fit in the homelette
framework:
It needs to be an object.
It needs to have an
__init__
method that can handle the named argumentsalignment
,target
,templates
andtag
.It needs a
generate_models
method.It needs an attribute
models
in which generated models are stored asModel
objects in list.
Any object that satisfies these criteria can be used in the framework.
Defining custom evaluations
As an example for a custom evaluation, we will implement a sample evaluation that counts the number of residues in the models.
[6]:
class Evaluation_countresidues():
'''
Custom evaluation: counting CA atoms
'''
def __init__(self, model, quiet=True): # (1)
self.model = model
self.output = dict()
# (2) perform evaluation
self.evaluate()
# (3) update model.evaluation
self.model.evaluation.update(self.output)
def evaluate(self): # (4)
# (5) parse model pdb
pdb = self.model.parse_pdb()
# count number of CA atoms in PDB
n_residues = pdb['name'].eq('CA').sum()
# append to output
self.output['n_residues'] = n_residues
The lines of code marked with numbers in the definiton of the custom evaluation get special comments here:
The
__init__
function takes exactly 2 arguments:model
andquiet
.quiet
is a boolean value indicating whether output to stdout should be suppressed (not applicable in this case).All evaluation metrics are executed upon initialization.
The
custom_evaluation.output
dictionary is merged with theModel.evaluation
dictionary to make the output of our evaluation metrics available to the model.Here we define the function where the actual evaluation takes place.
For the actual evaluation, we make use of the
Model.parse_pdb
method, which parses the PDB file associated to a specific model object to apandas
data frame. This can be useful for a number of evaluations (access residues, coordinates, etc.)
Note
If more arguments are required for a custom evaluation, we recomment to store them as attributes in the Model
objects and then access these attributes while running the evaluation.
Now we apply our custom evaluation to our previously generated models using the Task.evaluate_models
interface (for more details, see Tutorial 3):
[7]:
t.evaluate_models(Evaluation_countresidues)
t.get_evaluation()
[7]:
model | tag | routine | n_residues | |
---|---|---|---|---|
0 | custom_loop_1.pdb | custom_loop | loopmodel | 73 |
1 | custom_loop_2.pdb | custom_loop | loopmodel | 73 |
2 | custom_loop_3.pdb | custom_loop | loopmodel | 73 |
3 | custom_loop_4.pdb | custom_loop | loopmodel | 73 |
In practice, the formal requirements for a custom evaluation are the following:
It has to be an object.
__init__
has the two argumentsmodel
andquiet
. More arguments would work in conjunction withTask.evaluate_models
only if defaults are set and used. We recommend storing more arguments as attributes in theModel
object and then accessing them during the evaluation.It executes evaluation on initialization.
On finishing the evaluation, it updates the
Model.evaluation
dictionary with the results of the evaluation.
Further reading
Congratulations on finishing the tutorial on extending homelette
.
Please take again notice that on our online documentation, there is a page collecting user-submitted custom routines and evaluation metrics. User are encouraged to share if they implemented something which they might think could be useful for the community. If you are interested, please contact us on GitHub or via email.
There are more tutorials which might interest you:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 6: Learn about modelling protein complexes.
Tutorial 7: Learn about assembling custom pipelines.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
Session Info
[8]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
modeller 10.4
pandas 1.5.3
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:36
Tutorial 5: Parallelization
[1]:
import homelette as hm
import time
Introduction
Welcome to the fifth tutorial on homelette
. This tutorial is about parallelization in homelette
. When modelling hundreds or thousands of models, some processes can be significantly sped up by dividing the workload on multiple processes in parallel (supported by appropriate hardware).
There are possibilities to parallelize both model generation and evaluation in homelette
.
Alignment and Task setup
For this tutorial, we are using the same alignment as in Tutorial 1. Identical to previous tutorials, the alignment is imported and annotated, and a Task
object is set up.
[2]:
# read in the alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')
# annotate the alignment
aln.get_sequence('ARAF').annotate(
seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(
seq_type = 'structure',
pdb_code = '3NY5',
begin_res = '1',
begin_chain = 'A',
end_res = '81',
end_chain = 'A')
# initialize task object
t = hm.Task(
task_name = 'Tutorial5',
target = 'ARAF',
alignment = aln,
overwrite = True)
Parallel model generation
When trying to parallelize model generation, homelette
makes use of the parallelization methods implemented in the packages that homelette
uses, if they are available. Model generation with modeller
can be parallized and is available in homelette
through a simple handler [1,2].
All modeller
based, pre-implemented routines have the argument n_threads
which can be used to use parallelization. The default is n_threads = 1
which does not activate parallelization, but any number > 1 will distribute the workload on the number of threads requested using the modeller.parallel
submodule.
[3]:
# use only 1 thread to generate 20 models
start = time.perf_counter()
t.execute_routine(
tag = '1_thread',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single/',
n_models = 20)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 47.84
[4]:
# use 4 threads to generate 20 models faster
start = time.perf_counter()
t.execute_routine(
tag = '4_threads',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single/',
n_models = 20,
n_threads = 4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 15.44
Using multiple threads can significantly speed up model generation, especially if a large number of models is generated.
Note
Please be aware that the modeller.parallel
submodule uses the Python module pickle
, which requires objects to be pickled to be saved in a separate file. In practical terms, if you want to run parallelization in modeller with a custom object (i.e. a custom defined routine, see Tutorial 4), you cannot make use of parallelization unless you have imported it from a separate file. Therefore we recommend that custom routines and evaluation are saved in a separate file and then imported
from there.
The following code block shows how custom building blocks could be put in an external file (data/extension.py
) and then imported for modelling and analysis.
[5]:
# import from custom file
from data.extension import Custom_Routine, Custom_Evaluation
?Custom_Routine
Init signature: Custom_Routine()
Docstring: Custom routine waiting to be implemented.
File: ~/workdir/data/extension.py
Type: type
Subclasses:
[6]:
!cat data/extension.py
'''
Examples of custom objects for homelette in a external file.
'''
class Custom_Routine():
'''
Custom routine waiting to be implemented.
'''
def __init__(self):
print('TODO: implement this')
class Custom_Evaluation():
'''
Custom evaluation waiting to be implemented.
'''
def __init__(self):
print('TODO: implement this')
Alternatively, you could use the /homelette/extension/
folder in which extensions are stored. See our comments on extensions in our documentation for more details.
Parallel model evaluation
homelette
can also use parallelization to speed up model evaluation. This is internally archieved by using concurrent.futures.ThreadPoolExecutor
.
In order to use parallelization when performing evaluations, use the n_threads
argument in Task.evaluate_models
.
[7]:
# use 1 thread for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=1)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 468.37
[8]:
# use 4 threads for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 128.37
For some evaluation schemes, using parallelization can lead to a significant speedup.
Note
Please be advised that for some (very fast) evaluation methods, the time investment of spawning new child processes might not compensate for the speedup gained by parallelization. Test your usecase on your system in a small setting and use at your own discretion.
[9]:
# use 1 thread for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=1)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 10.34
[10]:
# use 4 threads for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 15.95
Note
When creating and using custom evaluation metrics, please make sure to avoid race conditions. Task.evaluate_models
is implemented with a protection against race conditions, but this is not bulletproof. Also, if you need to create temporary files, make sure to create file names with model-specific names (i.e. by using the model name in the file name). Defining custom evaluations in a separate file is not necessary, as parallelization of evaluation methods does not rely on pickle
.
Note
In case some custom evaluation metrics are very memory-demanding, running it in parallel can easily overwhelm the system. Again, we encourage you to test your usecase on your system in a small setting.
Further reading
Congratulation on completing Tutorial 5 about parallelization in homelette
. Please note that there are other tutorials, which will teach you more about how to use homelette
:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 6: Learn about modelling protein complexes.
Tutorial 7: Learn about assembling custom pipelines.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
Session Info
[11]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
data NA
homelette 1.4
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
pandas 1.5.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:56
Tutorial 6: Complex Modelling
[1]:
import homelette as hm
Introduction
Welcome to the 6th tutorial on homelette
about homology modelling of complex structures.
There are multiple issues about modelling protein complexes that make it a separate topic from the homology modelling of single structures:
Usually, a complex structure is required as a template.
Not all modelling programs can perform complex modelling.
Not all evaluation metrics developed for homology modelling are applicable to complex structures.
You need multiple alignments.
homelette
is able to use modeller
based modelling routines for complex modelling [1,2], and has some specific classes in place that make complex modelling easier to the user: - A function to assemble appropriate complex alignments - Special modelling classes for complex modelling - Special evaluation metrics for complex modelling
For this tutorial, we will build models for ARAF in complex with HRAS. As a template, we will use the structures [4G0N] (https://www.rcsb.org/structure/4G0N)(RAF1 in complex with HRAS) and 3NY5 (BRAF).
Alignment
Since all current modelling routines for protein complexes are modeller
based, an alignment according to the modeller
specification has to be constructed. homelette
has the helper function assemble_complex_aln
in the homelette.alignment
submodule that is able to do that:
[2]:
?hm.alignment.assemble_complex_aln
Signature:
hm.alignment.assemble_complex_aln(
*args: Type[ForwardRef('Alignment')],
names: dict,
) -> Type[ForwardRef('Alignment')]
Docstring:
Assemble complex alignments compatible with MODELLER from individual
alignments.
Parameters
----------
*args : Alignment
The input alignments
names : dict
Dictionary instructing how sequences in the different alignment objects
are supposed to be arranged in the complex alignment. The keys are the
names of the sequences in the output alignments. The values are
iterables of the sequence names from the input alignments in the order
they are supposed to appaer in the output alignment. Any value that can
not be found in the alignment signals that this position in the complex
alignment should be filled with gaps.
Returns
-------
Alignment
Assembled complex alignment
Examples
--------
>>> aln1 = hm.Alignment(None)
>>> aln1.sequences = {
... 'seq1_1': hm.alignment.Sequence('seq1_1', 'HELLO'),
... 'seq2_1': hm.alignment.Sequence('seq2_1', 'H---I'),
... 'seq3_1': hm.alignment.Sequence('seq3_1', '-HI--')
... }
>>> aln2 = hm.Alignment(None)
>>> aln2.sequences = {
... 'seq2_2': hm.alignment.Sequence('seq2_2', 'KITTY'),
... 'seq1_2': hm.alignment.Sequence('seq1_2', 'WORLD')
... }
>>> names = {'seq1': ('seq1_1', 'seq1_2'),
... 'seq2': ('seq2_1', 'seq2_2'),
... 'seq3': ('seq3_1', 'gaps')
... }
>>> aln_assembled = hm.alignment.assemble_complex_aln(
... aln1, aln2, names=names)
>>> aln_assembled.print_clustal()
seq1 HELLO/WORLD
seq2 H---I/KITTY
seq3 -HI--/-----
File: /usr/local/src/homelette-1.4/homelette/alignment.py
Type: function
In our case, we assemble an alignment from two different alignments, aln_1
which contains ARAF, RAF1 (4G0N) and BRAF (3NY5) and aln_2
which contains an HRAS sequence and the HRAS sequence from 4G0N.
[3]:
# import single alignments
aln1_file = 'data/complex/aln_eff.fasta_aln'
aln2_file = 'data/complex/aln_ras.fasta_aln'
aln_1 = hm.Alignment(aln1_file)
aln_2 = hm.Alignment(aln2_file)
# build dictionary that indicates how sequences should be assembled
names = {
'ARAF': ('ARAF', 'HRAS'),
'4G0N': ('4G0N', '4G0N'),
'3NY5': ('3NY5', ''),
}
# assemble alignment
aln = hm.alignment.assemble_complex_aln(aln_1, aln_2, names=names)
aln.remove_redundant_gaps()
aln.print_clustal(line_wrap=70)
ARAF ---GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVTAWDTAIAPLD
4G0N -TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASLI
3NY5 HQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ------KKPIGWDTDISWLT
ARAF GEELIVEVL------/MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLD
4G0N GEELQVDFL------/MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLD
3NY5 GEELHVEVLENVPLT/------------------------------------------------------
ARAF ILDTAGQEEYSAMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAART
4G0N ILDTAGQEE--AMRDQYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAART
3NY5 ----------------------------------------------------------------------
ARAF VESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQ-
4G0N VESRQAQDLARSYGIPYIETSAKTRQGVEDAFYTLVREIRQH
3NY5 ------------------------------------------
After assembling the complex alignment, we annotate it as usual:
[4]:
# annotate alignment
aln.get_sequence('ARAF').annotate(seq_type='sequence')
aln.get_sequence('4G0N').annotate(seq_type = 'structure',
pdb_code = '4G0N',
begin_res = '1',
begin_chain = 'A')
aln.get_sequence('3NY5').annotate(seq_type = 'structure',
pdb_code = '3NY5',
begin_res = '1',
begin_chain = 'A')
Modelling
There are 4 routines available specifically for complex modelling based on modeller
[1,2] and altmod
[3]. They run with the same parameters as their counterparts for single structure modelling, except that they handle naming of new chains and residue numbers a bit differently.
The following routines are available for complex modelling:
Routine_complex_automodel_default
Routine_complex_automodel_slow
Routine_complex_altmod_default
Routine_complex_altmod_slow
[5]:
# initialize task object
t = hm.Task(task_name='Tutorial6',
alignment=aln,
target='ARAF',
overwrite=True)
Modelling can be performed with Task.execute_routine
as usual.
[6]:
# generate models based on a complex template
t.execute_routine(tag='automodel_' + '4G0N',
routine=hm.routines.Routine_complex_automodel_default,
templates = ['4G0N'],
template_location='data/complex/',
n_models=20,
n_threads=5)
Not all templates have to be complex templates, it is perfectly applicable to mix complex templates and single templates. However, at least one complex template should be used in order to convey information about the orientation of the proteins to each other.
[7]:
# generate models based on a complex and a single template
t.execute_routine(tag='automodel_' + '_'.join(['4G0N', '3NY5']),
routine=hm.routines.Routine_complex_automodel_default,
templates = ['4G0N', '3NY5'],
template_location='data/complex',
n_models=20,
n_threads=5)
Evaluation
Not all evaluation metrics are designed to evaluate complex structures. For example, the SOAP score has different statistical potentials for single proteins (Evaluation_soap_protein
) and for protein complexes (Evaluation_soap_pp
) [4].
[8]:
# perform evaluation
t.evaluate_models(hm.evaluation.Evaluation_mol_probity,
hm.evaluation.Evaluation_soap_pp,
n_threads=5)
[9]:
# show a bit of the evaluation
t.get_evaluation().sort_values(by='soap_pp_all').head()
[9]:
model | tag | routine | mp_score | soap_pp_all | soap_pp_atom | soap_pp_pair | |
---|---|---|---|---|---|---|---|
32 | automodel_4G0N_3NY5_13.pdb | automodel_4G0N_3NY5 | complex_automodel_default | 2.25 | -9502.636719 | -7770.577637 | -1732.059326 |
39 | automodel_4G0N_3NY5_20.pdb | automodel_4G0N_3NY5 | complex_automodel_default | 2.15 | -9486.243164 | -7656.946777 | -1829.296143 |
28 | automodel_4G0N_3NY5_9.pdb | automodel_4G0N_3NY5 | complex_automodel_default | 2.46 | -9475.368164 | -7769.337891 | -1706.030396 |
29 | automodel_4G0N_3NY5_10.pdb | automodel_4G0N_3NY5 | complex_automodel_default | 2.72 | -9458.609375 | -7647.797852 | -1810.811646 |
9 | automodel_4G0N_10.pdb | automodel_4G0N | complex_automodel_default | 2.39 | -9405.662109 | -7718.845215 | -1686.817139 |
Further reading
Congratulation on finishing the tutorial about complex modelling in homelette
. The following tutorials might also be of interest to you:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 7: Learn about assembling custom pipelines.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
[3] Janson, G., Grottesi, A., Pietrosanto, M., Ausiello, G., Guarguaglini, G., & Paiardini, A. (2019). Revisiting the “satisfaction of spatial restraints” approach of MODELLER for protein homology modeling. PLoS Computational Biology, 15(12), e1007219. https://doi.org/10.1371/journal.pcbi.1007219
[4] Dong, G. Q., Fan, H., Schneidman-Duhovny, D., Webb, B., Sali, A., & Tramontano, A. (2013). Optimized atomic statistical potentials: Assessment of protein interfaces and loops. Bioinformatics, 29(24), 3158–3166. https://doi.org/10.1093/bioinformatics/btt560
Session Info
[10]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
pandas 1.5.3
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:40
Tutorial 7: Assembling custom pipelines
[1]:
import homelette as hm
import matplotlib as plt
import seaborn as sns
Introduction
Welcome to the final tutorial on homelette
. This tutorial is about combining what we learnt in the previous tutorials about model generating and model evaluating building blocks.
The strength of homelette
lies in its ability to A) be almost freely extendable by the user (see Tutorial 4) and B) in the ease with which pre-defined or custom-made building blocks for model generation and evaluation can be assembled into custom pipelines. This tutorial showcases B).
For our target sequence, ARAF, we will identify templates and generate alignments with the AlignmentGenerator_pdb
building block [1,2,3,4]. We will select two templates, BRAF (3NY5) and RAF1 (4G0N). We will build models for ARAF with two different routines, Routine_automodel_default
and Routine_automodel_slow
[5,6], and from the different templates. The generated models will be evaluated by SOAP
scores and MolProbity and a combined score will be calculated using Borda Count [7,8,9,10].
Alignment
Consistent with the other tutorials, we will be modelling the protein ARAF. For this tutorial, we will use the AlignmentGenerator_pdb
in order to search for templates, create an alignment, and process both the templates as well as the alignment:
[2]:
gen = hm.alignment.AlignmentGenerator_pdb.from_fasta('data/alignments/ARAF.fa')
[3]:
# search for templates and generate first alignment
gen.get_suggestion()
gen.show_suggestion()
Querying PDB...
Query successful, 16 found!
Retrieving sequences...
Sequences succefully retrieved!
Generating alignment...
Alignment generated!
[3]:
template | coverage | identity | |
---|---|---|---|
0 | 6XI7_2 | 100.0 | 60.27 |
1 | 1C1Y_2 | 100.0 | 60.27 |
2 | 1GUA_2 | 100.0 | 60.27 |
3 | 4G0N_2 | 100.0 | 60.27 |
4 | 4G3X_2 | 100.0 | 60.27 |
5 | 6VJJ_2 | 100.0 | 60.27 |
6 | 6XGU_2 | 100.0 | 60.27 |
7 | 6XGV_2 | 100.0 | 60.27 |
8 | 6XHA_2 | 100.0 | 60.27 |
9 | 6XHB_2 | 100.0 | 60.27 |
10 | 7JHP_2 | 100.0 | 60.27 |
11 | 3KUC_2 | 100.0 | 58.90 |
12 | 3KUD_2 | 100.0 | 58.90 |
13 | 3NY5_1 | 100.0 | 58.90 |
14 | 6NTD_2 | 100.0 | 53.42 |
15 | 6NTC_2 | 100.0 | 52.05 |
For this example, we will choose one template of BRAF (3NY5) and one template from RAF1 (4G0N):
[4]:
# select templates and show alignment
gen.select_templates(['3NY5_1', '4G0N_2'])
gen.alignment.print_clustal(70)
ARAF -------------GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVT
3NY5_1 MGHHHHHHSHMQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKALMMRGLIPECCAVYRIQ---DGEKKPI
4G0N_2 -----------TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
ARAF AWDTAIAPLDGEELIVEVL----------
3NY5_1 GWDTDISWLTGEELHVEVLENVPLTTHNF
4G0N_2 DWNTDAASLIGEELQVDFL----------
Next, we download the template structures and process both the alignment and the structures:
[5]:
# download structures, process alignment and structures
gen.get_pdbs()
gen.show_suggestion()
Guessing template naming format...
Template naming format guessed: polymer_entity!
Checking template dir...
Template dir not found...
New template dir created at
"/home/homelette/workdir/templates"!
Processing templates:
3NY5 downloading from PDB...
3NY5 downloaded!
3NY5_A: Chain extracted!
3NY5_A: Alignment updated!
3NY5_A: PDB processed!
3NY5_B: Chain extracted!
3NY5_B: Alignment updated!
3NY5_B: PDB processed!
3NY5_C: Chain extracted!
3NY5_C: Alignment updated!
3NY5_C: PDB processed!
3NY5_D: Chain extracted!
3NY5_D: Alignment updated!
3NY5_D: PDB processed!
4G0N downloading from PDB...
4G0N downloaded!
4G0N_B: Chain extracted!
4G0N_B: Alignment updated!
4G0N_B: PDB processed!
Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"/home/homelette/workdir/templates".
[5]:
template | coverage | identity | |
---|---|---|---|
0 | 4G0N_B | 100.00 | 60.27 |
1 | 3NY5_B | 94.52 | 57.53 |
2 | 3NY5_A | 93.15 | 57.53 |
3 | 3NY5_C | 93.15 | 57.53 |
4 | 3NY5_D | 91.78 | 57.53 |
We can see that there are multiple chains of 3NY5 that fit our alignment. One of the chains has less missing residues than the other ones, so we are choosing this one:
[6]:
# select templates
gen.select_templates(['4G0N_B', '3NY5_B'])
gen.alignment.print_clustal(70)
gen.show_suggestion()
ARAF ----GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVTAWDTAIAPL
4G0N_B --TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARLDWNTDAASL
3NY5_B SHQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ-----EKKPIGWDTDISWL
ARAF DGEELIVEVL--------
4G0N_B IGEELQVDFL--------
3NY5_B TGEELHVEVLENVPLTTH
[6]:
template | coverage | identity | |
---|---|---|---|
0 | 4G0N_B | 100.00 | 60.27 |
1 | 3NY5_B | 94.52 | 57.53 |
Now that we have our templates prepared and aligned, we can now define a custom Task
object in order to assemble homelette
building blocks into a pipeline:
Custom pipeline
The easiest way to formulate custom pipelines by assembling the homelette
building blocks of model building and evaluation is to construct custom Task
objects:
[7]:
class CustomPipeline(hm.Task):
'''
Example for a cumstom pipeline
'''
def model_generation(self, templates):
# model generation with automodel_default
self.execute_routine(tag='automodel_def_' + '-'.join(templates),
routine = hm.routines.Routine_automodel_default,
templates = templates,
template_location = './templates/',
n_models = 20,
n_threads = 5)
# model generation with autmodel_slow
self.execute_routine(tag='autmodel_slow_' + '-'.join(templates),
routine = hm.routines.Routine_automodel_slow,
templates = templates,
template_location = './templates/',
n_models = 20,
n_threads = 5)
def model_evaluation(self):
# perform evaluation
self.evaluate_models(hm.evaluation.Evaluation_mol_probity,
n_threads=5)
self.evaluate_models(hm.evaluation.Evaluation_soap_protein,
n_threads=5)
self.evaluate_models(hm.evaluation.Evaluation_qmean4,
n_threads=5)
ev = self.get_evaluation()
# borda count for best models
ev['points_soap'] = ev.shape[0] - ev['soap_protein'].rank()
ev['points_mol_probity'] = ev.shape[0] - ev['mp_score'].rank()
ev['borda_score'] = ev['points_soap'] + ev['points_mol_probity']
ev['borda_rank'] = ev['borda_score'].rank(ascending=False)
ev = ev.drop(labels=['points_soap', 'points_mol_probity'], axis=1)
return ev
We have constructed a custom Task
object (more specifically, a custom objects that inherits all methods and attributes from Task
) and added two more functions: model_generation
and model_evaluation
.
In CustomPipeline.model_generation
we are using two routines (Routine_automodel_default
and Routine_automodel_slow
) to generate 20 models each. In CustomPipeline.model_generation
we evaluate the models using Evaluation_mol_probity
and Evaluation_soap_protein
and then rank the generated models based on both evaluation metrics using Borda Count.
After constructing our pipeline, let’s execute it with two different templates as an example:
After having a custom Task
object defined, we can initialize it from the AlignmentGenerator
in order to do the modelling and evaluation:
[8]:
# initialize task from alignment generator
t = gen.initialize_task(
task_name = 'Tutorial7',
overwrite = True,
task_class = CustomPipeline)
[9]:
# execute pipeline for different templates
t.model_generation(['3NY5_B'])
t.model_generation(['4G0N_B'])
df_eval = t.model_evaluation()
We have successfully generated and evaluated 80 models.
[10]:
# get template from tag
df_eval['template'] = df_eval['tag'].str.contains('3NY5').map({True: '3NY5', False: '4G0N'})
[11]:
df_eval.sort_values(by = 'borda_rank').head(10)
[11]:
model | tag | routine | mp_score | soap_protein | qmean4 | qmean4_z_score | borda_score | borda_rank | template | |
---|---|---|---|---|---|---|---|---|---|---|
64 | autmodel_slow_4G0N_B_5.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.21 | -45545.746094 | 0.814469 | 0.255860 | 149.5 | 1.0 | 4G0N |
77 | autmodel_slow_4G0N_B_18.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.17 | -45043.023438 | 0.775498 | -0.340560 | 143.0 | 2.0 | 4G0N |
38 | autmodel_slow_3NY5_B_19.pdb | autmodel_slow_3NY5_B | automodel_slow | 2.42 | -48817.878906 | 0.769190 | -0.437096 | 141.0 | 3.0 | 3NY5 |
69 | autmodel_slow_4G0N_B_10.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.30 | -45205.257812 | 0.805243 | 0.114666 | 138.0 | 4.0 | 4G0N |
63 | autmodel_slow_4G0N_B_4.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.26 | -44921.707031 | 0.771055 | -0.408556 | 134.0 | 5.0 | 4G0N |
79 | autmodel_slow_4G0N_B_20.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.24 | -44596.234375 | 0.787342 | -0.159296 | 131.5 | 6.0 | 4G0N |
72 | autmodel_slow_4G0N_B_13.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.21 | -44206.707031 | 0.796167 | -0.024243 | 128.5 | 7.0 | 4G0N |
73 | autmodel_slow_4G0N_B_14.pdb | autmodel_slow_4G0N_B | automodel_slow | 2.39 | -44924.730469 | 0.754554 | -0.661071 | 126.0 | 8.0 | 4G0N |
49 | automodel_def_4G0N_B_10.pdb | automodel_def_4G0N_B | automodel_default | 2.47 | -45311.910156 | 0.767716 | -0.459645 | 125.0 | 9.0 | 4G0N |
34 | autmodel_slow_3NY5_B_15.pdb | autmodel_slow_3NY5_B | automodel_slow | 2.33 | -44530.144531 | 0.720679 | -1.179500 | 124.0 | 10.0 | 3NY5 |
We can see that most of the best 10 models were generated with the slower routine Routine_autmodel_slow
. This is to be expected, as this routine spends more time on model refinement and should therefore produce “better” models.
Next, we visualize the results of our evaluation with seaborn
.
Visualization
[12]:
# visualize combined score with seaborn
%matplotlib inline
# set font size
plt.rcParams.update({'font.size': 16})
plot = sns.boxplot(x = 'routine', y = 'borda_score', hue='template', data=df_eval,
palette='viridis')
plot.set(xlabel = 'Routine')
plot.set(ylabel = 'Combined Score')
plot.figure.set_size_inches(10, 10)
plot.legend(title = 'Template', loc = 'lower right', ncol = 2, fancybox = True)
#plot.figure.savefig('tutorial7.png', dpi=300)
[12]:
<matplotlib.legend.Legend at 0x7f23799902e0>

As expected, the routine which spends more time on model refinement (Routine_automodel_slow
) produces on average better results. Also, there are interesting differences between the templates used.
Further Reading
Congratulations on finishing the final tutorial about homelette
. You might also be interested in the other tutorials:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 6: Learn about modelling protein complexes.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Rose, Y., Duarte, J. M., Lowe, R., Segura, J., Bi, C., Bhikadiya, C., Chen, L., Rose, A. S., Bittrich, S., Burley, S. K., & Westbrook, J. D. (2021). RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. Journal of Molecular Biology, 433(11), 166704. https://doi.org/10.1016/J.JMB.2020.11.003
[2] Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 2017 35:11, 35(11), 1026–1028. https://doi.org/10.1038/nbt.3988
[3] Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J. D., & Higgins, D. G. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7(1), 539. https://doi.org/10.1038/MSB.2011.75
[4] Sievers, F., & Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science, 27(1), 135–145. https://doi.org/10.1002/PRO.3290
[5] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[6] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
[7] Dong, G. Q., Fan, H., Schneidman-Duhovny, D., Webb, B., Sali, A., & Tramontano, A. (2013). Optimized atomic statistical potentials: Assessment of protein interfaces and loops. Bioinformatics, 29(24), 3158–3166. https://doi.org/10.1093/bioinformatics/btt560
[8] Davis, I. W., Leaver-Fay, A., Chen, V. B., Block, J. N., Kapral, G. J., Wang, X., Murray, L. W., Arendall, W. B., Snoeyink, J., Richardson, J. S., & Richardson, D. C. (2007). MolProbity: all-atom contacts and structure validation for proteins and nucleic acids. Nucleic Acids Research, 35(suppl_2), W375–W383. https://doi.org/10.1093/NAR/GKM216
[9] Chen, V. B., Arendall, W. B., Headd, J. J., Keedy, D. A., Immormino, R. M., Kapral, G. J., Murray, L. W., Richardson, J. S., & Richardson, D. C. (2010). MolProbity: All-atom structure validation for macromolecular crystallography. Acta Crystallographica Section D: Biological Crystallography, 66(1), 12–21. https://doi.org/10.1107/S0907444909042073
[10] Williams, C. J., Headd, J. J., Moriarty, N. W., Prisant, M. G., Videau, L. L., Deis, L. N., Verma, V., Keedy, D. A., Hintze, B. J., Chen, V. B., Jain, S., Lewis, S. M., Arendall, W. B., Snoeyink, J., Adams, P. D., Lovell, S. C., Richardson, J. S., & Richardson, D. C. (2018). MolProbity: More and better reference data for improved all-atom structure validation. Protein Science, 27(1), 293–315. https://doi.org/10.1002/pro.3330
Session Info
[13]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
matplotlib 3.1.2
pandas 1.5.3
seaborn 0.12.2
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib_inline 0.1.6
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
scipy 1.10.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:50
Tutorial 8: Automatic Alignment Generation
[1]:
import homelette as hm
Introduction
Welcome to the eighth tutorial for homelette
, in which we will explore homelette
’s tool for automated alignment generation.
The alignment is a central step in homology modelling, and the quality of the alignment used for modelling has a lot of influence on the final models. In general, the challenge of creating solid sequence alignments is mainly dependent how closely the target and template are. If they share a high sequence identity, the alignments are easy to construct and the modelling process will most likely be successful.
Note
As a rule of thumb, it is said that everything above 50-60% sequence identity is well approachable, while everything below 30% sequence identity is very challenging to model.
homelette
has methods that can automatically generate an alignment given a query sequence. However, these methods hide some of the complexity of generating good alignments. Use them at your own discretion, especially for target sequences with low sequence identity to any template.
Note
Be careful with automatically generated alignments if your protein of interest has no closely related templates
After these words of caution, let’s look at the implemented methods:
alignment.AlignmentGenerator_pdb
: Query the PDB and local alignment with Clustal Omegaalignment.AlignmentGenerator_hhblits
: Local database search against PDB70 database.alignment.AlignmentGenerator_from_aln
: For if you already have an alignment ready, but want to make use ofhomelette
’s processing of templates and alignments.
Method 1: Querying RCSB and Realignment of template sequences with Clusta Omega
This class performs a three step process:
Template Identification: Query the RCSB using a sequence (interally, MMseq2 is used by RCSB) [1, 2] (
get_suggestion
)Then the sequences of identified templates are aligned locally using Clustal Omega [3, 4]. (
get_suggesion
)Finally, the template structures are downloaded and processed together with the alignment (
get_pdbs
)
Afterwards, the templates schould be ready for performing homology modelling.
For a practical demonstration, let’s find some templates for ARAF:
[2]:
gen = hm.alignment.AlignmentGenerator_pdb.from_fasta('data/alignments/ARAF.fa')
# gen = hm.alignment.AlignmentGenerator_pdb(
# sequence = 'GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLIKGRKTVTAWDTAIAPLDGEELIVEVL',
# target = 'ARAF')
There are two ways how AlignmentGenerator
can be initialized: either with a sequence, or from a fasta file. Both ways are shown above.
In the next step we use this sequence to generate an initial alignment:
[3]:
gen.get_suggestion()
Querying PDB...
Query successful, 16 found!
Retrieving sequences...
Sequences succefully retrieved!
Generating alignment...
Alignment generated!
As we can see from the output, we are querying the PDB and extracting potential templates. Then, an alignment is generated.
We can have a first look at the suggested templates as such:
[4]:
gen.show_suggestion()
[4]:
template | coverage | identity | |
---|---|---|---|
0 | 1C1Y_2 | 100.0 | 60.27 |
1 | 1GUA_2 | 100.0 | 60.27 |
2 | 4G0N_2 | 100.0 | 60.27 |
3 | 4G3X_2 | 100.0 | 60.27 |
4 | 6VJJ_2 | 100.0 | 60.27 |
5 | 6XGU_2 | 100.0 | 60.27 |
6 | 6XGV_2 | 100.0 | 60.27 |
7 | 6XHA_2 | 100.0 | 60.27 |
8 | 6XHB_2 | 100.0 | 60.27 |
9 | 6XI7_2 | 100.0 | 60.27 |
10 | 7JHP_2 | 100.0 | 60.27 |
11 | 3KUC_2 | 100.0 | 58.90 |
12 | 3KUD_2 | 100.0 | 58.90 |
13 | 3NY5_1 | 100.0 | 58.90 |
14 | 6NTD_2 | 100.0 | 53.42 |
15 | 6NTC_2 | 100.0 | 52.05 |
[5]:
gen.alignment.print_clustal(70)
ARAF -------------GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVT
1C1Y_2 ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
1GUA_2 --------PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
3KUC_2 --------PSKTSNTIRVFLPNKQRTVVRVRNGMSLHDCLMKKLKVRGLQPECCAVFRLLHEHKGKKARL
3KUD_2 --------PSKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKKLKVRGLQPECCAVFRLLHEHKGKKARL
3NY5_1 MGHHHHHHSHMQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKALMMRGLIPECCAVYRIQ---DGEKKPI
4G0N_2 -----------TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
4G3X_2 ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6NTC_2 --------GAMDSNTIRVLLPNQEWTVVKVRNGMSLHDSLMKALKRHGLQPESSAVFRLLHEHKGKKARL
6NTD_2 --------GAMDSNTIRVLLPNHERTVVKVRNGMSLHDSLMKALKRHGLQPESSAVFRLLHEHKGKKARL
6VJJ_2 ---------SKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6XGU_2 ---------SKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6XGV_2 ---------SKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6XHA_2 ---------SKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6XHB_2 ---------SKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6XI7_2 ---------SKTSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
7JHP_2 ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
ARAF AWDTAIAPLDGEELIVEVL---------------------------------------------------
1C1Y_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
1GUA_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
3KUC_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
3KUD_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
3NY5_1 GWDTDISWLTGEELHVEVLENVPLTTHNF-----------------------------------------
4G0N_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
4G3X_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
6NTC_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
6NTD_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
6VJJ_2 DWNTDAASLIGEELQVDFL---------------------------------------------------
6XGU_2 DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XGV_2 DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XHA_2 DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XHB_2 DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XI7_2 DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
7JHP_2 DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
ARAF ------
1C1Y_2 ------
1GUA_2 ------
3KUC_2 ------
3KUD_2 ------
3NY5_1 ------
4G0N_2 ------
4G3X_2 ------
6NTC_2 ------
6NTD_2 ------
6VJJ_2 ------
6XGU_2 MCVDWS
6XGV_2 MCVDWS
6XHA_2 MCVDWS
6XHB_2 MCVDWS
6XI7_2 MCVDWS
7JHP_2 MCVDW-
After potentially filtering out some sequences, we can proceed with the next step: downloading the structures for our templates, comparing the sequences of the templates with the residues present in the template structure and make adjustments to both the structure and the alignment if necessary.
[6]:
gen.get_pdbs()
Guessing template naming format...
Template naming format guessed: polymer_entity!
Checking template dir...
Template dir found!
Processing templates:
1C1Y downloading from PDB...
1C1Y downloaded!
1C1Y_B: Chain extracted!
1C1Y_B: Alignment updated!
1C1Y_B: PDB processed!
1GUA downloading from PDB...
1GUA downloaded!
1GUA_B: Chain extracted!
1GUA_B: Alignment updated!
1GUA_B: PDB processed!
3KUC downloading from PDB...
3KUC downloaded!
3KUC_B: Chain extracted!
3KUC_B: Alignment updated!
3KUC_B: PDB processed!
3KUD downloading from PDB...
3KUD downloaded!
3KUD_B: Chain extracted!
3KUD_B: Alignment updated!
3KUD_B: PDB processed!
3NY5 downloading from PDB...
3NY5 downloaded!
3NY5_A: Chain extracted!
3NY5_A: Alignment updated!
3NY5_A: PDB processed!
3NY5_B: Chain extracted!
3NY5_B: Alignment updated!
3NY5_B: PDB processed!
3NY5_C: Chain extracted!
3NY5_C: Alignment updated!
3NY5_C: PDB processed!
3NY5_D: Chain extracted!
3NY5_D: Alignment updated!
3NY5_D: PDB processed!
4G0N downloading from PDB...
4G0N downloaded!
4G0N_B: Chain extracted!
4G0N_B: Alignment updated!
4G0N_B: PDB processed!
4G3X downloading from PDB...
4G3X downloaded!
4G3X_B: Chain extracted!
4G3X_B: Alignment updated!
4G3X_B: PDB processed!
6NTC downloading from PDB...
6NTC downloaded!
6NTC_B: Chain extracted!
6NTC_B: Alignment updated!
6NTC_B: PDB processed!
6NTD downloading from PDB...
6NTD downloaded!
6NTD_B: Chain extracted!
6NTD_B: Alignment updated!
6NTD_B: PDB processed!
6VJJ downloading from PDB...
6VJJ downloaded!
6VJJ_B: Chain extracted!
6VJJ_B: Alignment updated!
6VJJ_B: PDB processed!
6XGU downloading from PDB...
6XGU downloaded!
6XGU_B: Chain extracted!
6XGU_B: Alignment updated!
6XGU_B: PDB processed!
6XGV downloading from PDB...
6XGV downloaded!
6XGV_B: Chain extracted!
6XGV_B: Alignment updated!
6XGV_B: PDB processed!
6XHA downloading from PDB...
6XHA downloaded!
6XHA_B: Chain extracted!
6XHA_B: Alignment updated!
6XHA_B: PDB processed!
6XHB downloading from PDB...
6XHB downloaded!
6XHB_B: Chain extracted!
6XHB_B: Alignment updated!
6XHB_B: PDB processed!
6XI7 downloading from PDB...
6XI7 downloaded!
6XI7_B: Chain extracted!
6XI7_B: Alignment updated!
6XI7_B: PDB processed!
7JHP downloading from PDB...
7JHP downloaded!
7JHP_C: Chain extracted!
7JHP_C: Alignment updated!
7JHP_C: PDB processed!
Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"/home/homelette/workdir/templates".
get_pdbs
will check all chains of a template and download those with the correct sequence.
[7]:
gen.alignment.print_clustal(70)
ARAF -------------GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVT
1C1Y_B ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
1GUA_B -------------NTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
3KUC_B -------------NTIRVFLPNKQRTVVRVRNGMSLHDCLMKKLKVRGLQPECCAVFRLLHEHKGKKARL
3KUD_B -------------NTIRVFLPNKQRTVVNVRNGMSLHDCLMKKLKVRGLQPECCAVFRLLHEHKGKKARL
3NY5_A ---------H-QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ------KKPI
3NY5_B --------SH-QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ-----EKKPI
3NY5_C -----------QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ------KKPI
3NY5_D ---------H-QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRI-------KKPI
4G0N_B -----------TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
4G3X_B ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6NTC_B -------------NTIRVLLPNQEWTVVKV---MSLHDSLMKALKRHGLQPESSAVF---------KARL
6NTD_B ------------SNTIRVLLPNHERTVVKVRNGMSLHDSLMKALKRHGLQPESSAVF-----------RL
6VJJ_B ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
6XGU_B ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPE-CAVFRLLHEHKGKKARL
6XGV_B ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPE-CAVFRLLHEHKGKKARL
6XHA_B ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPE-CAVFRLLHEHKGKKARL
6XHB_B ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPE-CAVFRLLHEHKGKKARL
6XI7_B -------------NTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLH----KKARL
7JHP_C ------------SNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLL-----KKARL
ARAF AWDTAIAPLDGEELIVEVL---------------------------------------------------
1C1Y_B DWNTDAASLIGEELQVDFL---------------------------------------------------
1GUA_B DWNTDAASLIGEELQVDFL---------------------------------------------------
3KUC_B DWNTDAASLIGEELQVDFL---------------------------------------------------
3KUD_B DWNTDAASLIGEELQVDFL---------------------------------------------------
3NY5_A GWDTDISWLTGEELHVEVLENVPLT---------------------------------------------
3NY5_B GWDTDISWLTGEELHVEVLENVPLTTH-------------------------------------------
3NY5_C GWDTDISWLTGEELHVEVLENVPLTTH-------------------------------------------
3NY5_D GWDTDISWLTGEELHVEVLENVPL----------------------------------------------
4G0N_B DWNTDAASLIGEELQVDFL---------------------------------------------------
4G3X_B DWNTDAASLIGEELQVDFL---------------------------------------------------
6NTC_B DWNTDAASLIGEELQVDF----------------------------------------------------
6NTD_B DWNTDAASLIGEELQVD-----------------------------------------------------
6VJJ_B DWNTDAASLIGEELQVDFL---------------------------------------------------
6XGU_B DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XGV_B DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XHA_B DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XHB_B DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
6XI7_B DWNTDAASLIGEELQVDFLDHVPLTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
7JHP_C DWNTDAASLIGEELQVDFLDH--LTTHNFARKTFLKLAFCDICQKFLLNGFRCQTCGYKFHEHCSTKVPT
ARAF ------
1C1Y_B ------
1GUA_B ------
3KUC_B ------
3KUD_B ------
3NY5_A ------
3NY5_B ------
3NY5_C ------
3NY5_D ------
4G0N_B ------
4G3X_B ------
6NTC_B ------
6NTD_B ------
6VJJ_B ------
6XGU_B MCVDWS
6XGV_B MCVDWS
6XHA_B MCVDWS
6XHB_B MCVDWS
6XI7_B MCV---
7JHP_C MCVDW-
Now we can directly use these template for homology modelling:
[8]:
# initialize task
t = gen.initialize_task(task_name = 'Tutorial8', overwrite = True)
# create a model per template
templates = [temp for temp in t.alignment.sequences.keys() if temp != 'ARAF']
for template in templates:
t.execute_routine(
tag = f'test_{template}',
routine = hm.routines.Routine_automodel_default,
templates = [template],
template_location = './templates/'
)
[9]:
# inspect models
t.models
[9]:
[<homelette.organization.Model at 0x7f22492f4340>,
<homelette.organization.Model at 0x7f22492f45b0>,
<homelette.organization.Model at 0x7f229829a610>,
<homelette.organization.Model at 0x7f2273b6afa0>,
<homelette.organization.Model at 0x7f2273b38ee0>,
<homelette.organization.Model at 0x7f22491c0e50>,
<homelette.organization.Model at 0x7f22491bf070>,
<homelette.organization.Model at 0x7f22491bf880>,
<homelette.organization.Model at 0x7f22491c5760>,
<homelette.organization.Model at 0x7f22491c5a00>,
<homelette.organization.Model at 0x7f22491c8310>,
<homelette.organization.Model at 0x7f22491c8820>,
<homelette.organization.Model at 0x7f22491b0f10>,
<homelette.organization.Model at 0x7f22491c96a0>,
<homelette.organization.Model at 0x7f22491c9b80>,
<homelette.organization.Model at 0x7f22491c8af0>,
<homelette.organization.Model at 0x7f22492f49d0>,
<homelette.organization.Model at 0x7f22491bfbe0>,
<homelette.organization.Model at 0x7f2273b38040>]
Method 2: HHSuite
This class is build on the hhblits
query function of the HHSuite3 [5].
This has the same interface as AlignmentGenerator_pdb
, except some different settings for the alignment generation with get_pdbs
.
It should also be noted that technically, this approach does not generate a multiple sequence alignment, but rather a combined alignment of lots of pairwise alignments of query to template. These pairwise alignments are combined on the common sequence they are all aligned to.
(This code is commented out since it requires big databases to run, which are not part of the docker container.)
[10]:
# gen = hm.alignment.AlignmentGenerator_hhblits.from_fasta('data/alignments/ARAF.fa')
# gen.get_suggestion(database_dir='/home/philipp/Downloads/hhsuite_dbs/')
# gen.get_pdbs()
# gen.show_suggestion()
# t = gen.initialize_task()
Method 3: Using pre-computed alignments
If you already have an alignment computed, but want to make use of get_pdbs
in order to download the templates and process the alignment and the template structures, there is also the possibility to load your alignment into an AlignmentGenerator
object:
[11]:
# initialize an alignment generator from a pre-computed alignemnt
gen = hm.alignment.AlignmentGenerator_from_aln(
alignment_file = 'data/alignments/unprocessed.fasta_aln',
target = 'ARAF')
gen.show_suggestion()
gen.alignment.print_clustal(70)
gen.get_pdbs()
gen.alignment.print_clustal(70)
ARAF -------------GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVT
3NY5 MGHHHHHHSHMQKPIVRVFLPNKQRTVVPARCGVTVRDSLKKALMMRGLIPECCAVYRIQ---DGEKKPI
4G0N -----------TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
ARAF AWDTAIAPLDGEELIVEVL----------
3NY5 GWDTDISWLTGEELHVEVLENVPLTTHNF
4G0N DWNTDAASLIGEELQVDFL----------
Guessing template naming format...
Template naming format guessed: entry!
Checking template dir...
Template dir found!
Processing templates:
3NY5 downloading from PDB...
3NY5 downloaded!
3NY5_A: Chain extracted!
3NY5_A: Alignment updated!
3NY5_A: PDB processed!
3NY5_B: Chain extracted!
3NY5_B: Alignment updated!
3NY5_B: PDB processed!
3NY5_C: Chain extracted!
3NY5_C: Alignment updated!
3NY5_C: PDB processed!
3NY5_D: Chain extracted!
3NY5_D: Alignment updated!
3NY5_D: PDB processed!
4G0N downloading from PDB...
4G0N downloaded!
4G0N_B: Chain extracted!
4G0N_B: Alignment updated!
4G0N_B: PDB processed!
Finishing... All templates successfully
downloaded and processed!
Templates can be found in
"./templates/".
ARAF -------------GTVKVYLPNKQRTVVTVRDGMSVYDSLDKALKVRGLNQDCCVVYRLI---KGRKTVT
3NY5_A ---------H-QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ------KKPI
3NY5_B --------SH-QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ-----EKKPI
3NY5_C -----------QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRIQ------KKPI
3NY5_D ---------H-QKPIVRVFLPNKQRTVVPARCGVTVRDSLKKAL--RGLIPECCAVYRI-------KKPI
4G0N_B -----------TSNTIRVFLPNKQRTVVNVRNGMSLHDCLMKALKVRGLQPECCAVFRLLHEHKGKKARL
ARAF AWDTAIAPLDGEELIVEVL----------
3NY5_A GWDTDISWLTGEELHVEVLENVPLT----
3NY5_B GWDTDISWLTGEELHVEVLENVPLTTH--
3NY5_C GWDTDISWLTGEELHVEVLENVPLTTH--
3NY5_D GWDTDISWLTGEELHVEVLENVPL-----
4G0N_B DWNTDAASLIGEELQVDFL----------
Again, for every template structure, homelette
is finding which chains fit to the sequence and then extract all of them.
Of course, if your alignment and template(s) are already processed, it is perfectly fine to use the Alignment
class directly as we have done in the previous tutorials.
Implementing own methods
While not discussed in Tutorial 4, AlignmentGenerator
object are also building blocks in the homelette
framework and custom versions can be implemented. All AlignmentGenerator
children classes so far inherit from the AlignmentGenerator
abstract base class, which contains some useful functionality for writing your own alignment generations, in particular the get_pdbs
function.
Further Reading
Congratulation on finishing the tutorial about alignment generation in homelette
.
Please note that there are other tutorials, which will teach you more about how to use homelette
.
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 5: Learn about how to use parallelization in order to generate and evaluate models more efficiently.
Tutorial 6: Learn about modelling protein complexes.
Tutorial 7: Learn about assembling custom pipelines.
References
[1] Rose, Y., Duarte, J. M., Lowe, R., Segura, J., Bi, C., Bhikadiya, C., Chen, L., Rose, A. S., Bittrich, S., Burley, S. K., & Westbrook, J. D. (2021). RCSB Protein Data Bank: Architectural Advances Towards Integrated Searching and Efficient Access to Macromolecular Structure Data from the PDB Archive. Journal of Molecular Biology, 433(11), 166704. https://doi.org/10.1016/J.JMB.2020.11.003
[2] Steinegger, M., & Söding, J. (2017). MMseqs2 enables sensitive protein sequence searching for the analysis of massive data sets. Nature Biotechnology 2017 35:11, 35(11), 1026–1028. https://doi.org/10.1038/nbt.3988
[3] Sievers, F., Wilm, A., Dineen, D., Gibson, T. J., Karplus, K., Li, W., Lopez, R., McWilliam, H., Remmert, M., Söding, J., Thompson, J. D., & Higgins, D. G. (2011). Fast, scalable generation of high-quality protein multiple sequence alignments using Clustal Omega. Molecular Systems Biology, 7(1), 539. https://doi.org/10.1038/MSB.2011.75
[4] Sievers, F., & Higgins, D. G. (2018). Clustal Omega for making accurate alignments of many protein sequences. Protein Science, 27(1), 135–145. https://doi.org/10.1002/PRO.3290
[5] Steinegger, M., Meier, M., Mirdita, M., Vöhringer, H., Haunsberger, S. J., & Söding, J. (2019). HH-suite3 for fast remote homology detection and deep protein annotation. BMC Bioinformatics, 20(1), 1–15. https://doi.org/10.1186/S12859-019-3019-7/FIGURES/7
Session Info
[12]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
homelette 1.4
pandas 1.5.3
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:40
API Documentation
This is the documentation for all classes, methods and functions in homelette.
homelette.organization
The homelette.organization
submodule contains classes for organizing
workflows.
Task
is an object orchestrating model generation and evaluation.
Model
is an object used for storing information about generated
models.
Tutorials
For an introduction to homelette’s workflow, Tutorial 1 is useful. Assembling custom pipelines is discussed in Tutorial 7.
Classes
The following classes are part of this submodule:
- class homelette.organization.Task(task_name: str, target: str, alignment: Type[Alignment], task_directory: str = None, overwrite: bool = False)
Class for directing modelling and evaluation.
It is designed for the modelling of one target sequence from one or multiple templates.
If an already existing folder with models is specified, the Task object will load those models in automatically. In this case, it can also be used exclusively for evaluation purposes.
- Parameters:
task_name (str) – The name of the task
target (str) – The identifier of the protein to model
alignment (Alignment) – The alignment object that will be used for modelling
task_directory (str, optional) – The directory that will be used for this modelling task (default is creating a new one based on the task_name)
overwrite (bool, optional) – Boolean value determining if an already existing task_directory should be overwriten. If a directory already exists for a given task_name or task_directory, this will determine whether the directory and all its contents will be overwritten (True), or whether the contained models will be imported (False) (default is False)
- Variables:
task_name (str) – The name of the task
task_directory (str) – The directory that will be used for this modelling task (default is to use the task_name)
target (str) – The identifier of the protein to model
alignment (Alignment) – The alignment object that will be used for modelling
models (list) – List of models generated or imported by this task
routines (list) – List of modelling routines executed by this task
- Return type:
None
- execute_routine(tag: str, routine: Type[routines.Routine], templates: Iterable, template_location: str = '.', **kwargs) None
Generates homology models using a specified modelling routine
- Parameters:
tag (str) – The identifier associated with this combination of routine and template(s). Has to be unique between all routines executed by the same task object
routine (Routine) – The routine object used to generate the models
templates (list) – The iterable containing the identifier(s) of the template(s) used for model generation
template_location (str, optional) – The location of the template PDB files. They should be named according to their identifiers in the alignment (i.e. for a sequence named “1WXN” to be used as a template, it is expected that there will be a PDB file named “1WXN.pdb” in the specified template location (default is current working directory)
**kwargs – Named parameters passed directly on to the Routine object when the modelling is performed. Please check the documentation in order to make sure that the parameters passed on are available with the Routine object you intend to use
- Return type:
None
- evaluate_models(*args: Type[evaluation.Evaluation], n_threads: int = 1) None
Evaluates models using one or multiple evaluation metrics
- Parameters:
*args (Evaluation) – Evaluation objects that will be applied to the models
n_threads (int, optional) – Number of threads used for model evaluation (default is 1, which deactivates parallelization)
- Return type:
None
- get_evaluation() pandas.DataFrame
Return evaluation for all models as pandas dataframe.
- Returns:
Dataframe containing all model evaluation
- Return type:
pd.DataFrame
- class homelette.organization.Model(model_file: str, tag: str, routine: str)
Interface used to interact with created protein structure models.
- Parameters:
model_file (str) – The file location of the PDB file for this model
tag (str) – The tag that was used when generating this model (see
Task.execute_routine
for more details)routine (str) – The name of the routine that was used to generate this model
- Variables:
model_file (str) – The file location of the PDB file for this model
tag (str) – The tag that was used when generating this model (see Task.execute_routine for more details)
routine (str) – The name of the routine that was used to generate this model
info (dict) – Dictionary that can be used to store metadata about the model (i.e. for some evaluation metrics)
- Return type:
None
- parse_pdb() pandas.DataFrame
Parses ATOM and HETATM records in PDB file to pandas dataframe Useful for giving some evaluations methods access to data from the PDB file.
- Return type:
pd.DataFrame
Notes
Information is extracted according to the PDB file specification (version 3.30) and columns are named accordingly. See https://www.wwpdb.org/documentation/file-format for more information.
- get_sequence() str
Retrieve the 1-letter amino acid sequence of the PDB file associated with the Model object.
- Returns:
Amino acid sequence
- Return type:
str
- rename(new_name: str) None
Rename the PDB file associated with the Model object.
- Parameters:
new_name (str) – New name of PDB file
- Return type:
None
homelette.alignment
The homelette.alignment
submodule contains a selection of tools for
handling sequences and alignments, as well as for the automatic generation of
sequences from a target sequence.
Tutorials
Basic handing of alignments with homelette is demonstrated in Tutorial 1. The assembling of alignments for complex modelling is discussed in Tutorial 6. The automatic generation of alignments is shown in Tutorial 8.
Functions and classes
Functions and classes present in homelette.alignment are listed below:
- class homelette.alignment.Alignment(file_name: Optional[str] = None, file_format: str = 'fasta')
Bases:
object
Class for managing sequence alignments.
- Parameters:
file_name (str, optional) – The file to read the alignment from. If no file name is given, an empty alignment object will be created (default None)
file_format (str, optional) – The format of the alignment file. Can be ‘fasta’ or ‘pir’ (default ‘fasta’)
- Variables:
sequences (dict) – Collection of sequences. Sequences names are the dictionary keys, Sequence objects the values
- Raises:
ValueError – File_format specified is not ‘fasta’ or ‘pir’
- get_sequence(sequence_name: str) Type[Sequence]
Retrieve sequence object by sequence name.
- Parameters:
sequence_name (str) – Name of sequence to retrieve
- Return type:
- select_sequences(sequence_names: Iterable) None
Select sequences to remain in the alignment by sequence name
- Parameters:
sequence_names (iterable) – Iterable of sequence names
- Return type:
None
- Raises:
KeyError – Sequence name not found in alignment
- remove_sequence(sequence_name: str) None
Remove a sequence from the alignment by sequence name.
- Parameters:
sequence_name (str) – Sequence name to remove from alignment
- Return type:
None
- rename_sequence(old_name: str, new_name: str) None
Rename sequence in the alignment
- Parameters:
old_name (str) – Old name of sequence
new_name (str) – New name of sequence
- Return type:
None
- write_pir(file_name: str, line_wrap: int = 50) None
Write alignment to file in the PIR file format.
- Parameters:
file_name (str) – File name to write to
line_wrap (int) – Characters per line (default 50)
- Return type:
None
- write_fasta(file_name: str, line_wrap: int = 80) None
Write alignment to file in the FASTA alignment file format.
- Parameters:
file_name (str) – File name to write to
line_wrap (int) – Characters per line (default 80)
- Return type:
None
- print_clustal(line_wrap: int = 80) None
Print alignment to console in the clustal file format.
- Parameters:
line_wrap (int) – Characters per line (default 80)
- Return type:
None
- write_clustal(file_name: str, line_wrap: int = 50) None
Write alignment to file in the clustal file format.
- Parameters:
file_name (str) – File name to write to
line_wrap (int) – Characters per line (default 50)
- Return type:
None
- remove_redundant_gaps() None
Remove gaps in the alignment that are present in every column.
- Return type:
None
- replace_sequence(seq_name: str, new_sequence: str) None
Targeted replacement of sequence in alignment.
- Parameters:
seq_name (str) – The identifier of the sequence that will be replaced.
new_sequence (str) – The new sequence.
Notes
This replacement is designed to introduce missing residues from template structures into the alignment and therefore has very strict requirements. The new and old sequence have to be identical, except that the new sequence might contain unmodelled residues. These are indicated by the letter ‘X’ in the new sequence, and will result in a gap ‘-’ in the alignment after replacement. It is important that all unmodelled residues, even at the start or beginning of the template sequence are correctly labeled as ‘X’.
Examples
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'), ... 'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> replacement_seq1 = 'AAAAXXXXXDDD' >>> replacement_seq3 = 'AAXXXXDD' >>> aln.replace_sequence('seq1', replacement_seq1) >>> aln.print_clustal() seq1 AAAA-----DDD seq2 AAAAEEEEDDDD seq3 AAAA----DDDD >>> aln.replace_sequence('seq3', replacement_seq3) >>> aln.print_clustal() seq1 AAAA-----DDD seq2 AAAAEEEEDDDD seq3 AA--------DD
- calc_identity(sequence_name_1: str, sequence_name_2: str) float
Calculate sequence identity between two sequences in the alignment.
- Parameters:
sequence_name_1 (str) – Sequence pair to calculate identity for
sequence_name_2 (str) – Sequence pair to calculate identity for
- Returns:
identity – Sequence identity between the two sequences
- Return type:
float
Notes
There are mutiple ways of calculating sequence identity, which can be useful in different situations. Here implemented is one way which makes a lot of sence for evaluating templates for homology modelling. The sequence identity is calculated by dividing the number of matches by the length of sequence 1 (mismatches and gaps are handled identically, no gap compression).
\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]Examples
Gaps and mismatches are treated equally.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'), ... 'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_identity('seq1', 'seq2') 66.67 >>> aln.calc_identity('seq1', 'seq3') 66.67
Normalization happens for the length of sequence 1, so the order of sequences matters.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_identity('seq1', 'seq2') 66.67 >>> aln.calc_identity('seq2', 'seq1') 100.0
- calc_pairwise_identity_all() Type[pandas.DataFrame]
Calculate identity between all sequences in the alignment.
- Returns:
identities – Dataframe with pairwise sequence identites
- Return type:
pd.DataFrame
See also
Notes
Calculates sequence identity as descripted for calc_identity:
\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]
- calc_identity_target(sequence_name: str) Type[pandas.DataFrame]
Calculate identity of all sequences in the alignment to specified target sequence.
- Parameters:
sequence_name (str) – Target sequence
- Returns:
identities – Dataframe with pairwise sequence identities
- Return type:
pd.DataFrame
See also
Notes
Calculates sequence identity as descripted for calc_identity:
\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]
- calc_coverage(sequence_name_1: str, sequence_name_2: str) float
Calculation of coverage of sequence 2 to sequence 1 in the alignment.
- Parameters:
sequence_name_1 (str) – Sequence pair to calculate coverage for
sequence_name_2 (str) – Sequence pair to calculate coverage for
- Returns:
coverage – Coverage of sequence 2 to sequence 1
- Return type:
float
Notes
Coverage in this context means how many of the residues in sequences 1 are assigned a residue in sequence 2. This is useful for evaluating potential templates, because a low sequence identity (as implemented in homelette) could be caused either by a lot of residues not being aligned at all, or a lot of residues being aligned but not with identical residues.
\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]Examples
Gaps and mismatches are not treated equally.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'), ... 'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_coverage('seq1', 'seq2') 100.0 >>> aln.calc_coverage('seq1', 'seq3') 66.67
Normalization happens for the length of sequence 1, so the order of sequences matters.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_coverage('seq1', 'seq2') 66.67 >>> aln.calc_coverage('seq2', 'seq1') 100.0
- calc_coverage_target(sequence_name: str) Type[pandas.DataFrame]
Calculate coverage of all sequences in the alignment to specified target sequence.
- Parameters:
sequence_name (str) – Target sequence
- Returns:
coverages – Dataframe with pairwise coverage
- Return type:
pd.DataFrame
See also
Notes
Calculates coverage as described for calc_coverage:
\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]
- calc_pairwise_coverage_all() Type[pandas.DataFrame]
Calculate coverage between all sequences in the alignment.
- Returns:
coverages – Dataframe with pairwise coverage
- Return type:
pd.DataFrame
See also
Notes
Calculates coverage as described for calc_coverage:
\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]
- class homelette.alignment.Sequence(name: str, sequence: str, **kwargs)
Bases:
object
Class that contains individual sequences and miscellaneous information about them.
- Parameters:
name (str) – Identifier of the sequence
sequence (str) – Sequence in 1 letter amino acid code
**kwargs – Annotations, for acceptable keys see
Sequence.annotate()
- Variables:
name (str) – Identifier of the sequence
sequence (str) – Sequence in 1 letter amino acid code
annotation (dict) – Collection of annotation for this sequence
Notes
See
Sequence.annotate()
for more information on the annotation of sequences.- annotate(**kwargs: str)
Change annotation for sequence object.
Keywords not specified in the Notes section will be ignored.
- Parameters:
kwargs (str) – Annotations. For acceptible values, see Notes.
- Return type:
None
Notes
Annotations are important for MODELLER in order to properly process alignment in PIR format. The following annotations are supported and can be modified.
annotation
explanation
seq_type
Specification whether sequence should be treated as a template (set to ‘structure’) or as a target (set to ‘sequence’)
pdb_code
PDB code corresponding to sequence (if available)
begin_res
Residue number for the first residue of the sequence in the corresponing PDB file
begin_chain
Chain identifier for the first residue of the sequence in the corresponding PDB file
end_res
Residue number for the last residue of the sequence in the corresponding PDB file
end_chain
Chain identifier for the last residue of the sequence in the corresponding PDB file
prot_name
Protein name, optional
prot_source
Protein source, optional
resolution
Resolution of PDB structure, optional
R_factor
R-factor of PDB structure, optional
Different types of annotations are required, depending whether a target or a template is annotated. For targets, it is sufficient to seq the seq_type to ‘sequence’. For templates, it is required by MODELLER that seq_type and pdb_code are annotated. begin_res, begin_chain, end_res and end_chain are recommended. The rest can be left unannoted.
Examples
Annotation for a target sequence.
>>> target = hm.alignment.Sequence(name = 'target', sequence = ... 'TARGET') >>> target.annotation {'seq_type': '', 'pdb_code': '', 'begin_res': '', 'begin_chain': '', 'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''} >>> target.annotate(seq_type = 'sequence') >>> target.annotation {'seq_type': 'sequence', 'pdb_code': '', 'begin_res': '', 'begin_chain': '', 'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''}
Annotation for a template structure.
>>> template = hm.alignment.Sequence(name = 'template', sequence = ... 'TEMPLATE') >>> template.annotation {'seq_type': '', 'pdb_code': '', 'begin_res': '', 'begin_chain': '', 'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''} >>> template.annotate(seq_type = 'structure', pdb_code = 'TMPL', ... begin_res = '1', begin_chain = 'A', end_res = '8', end_chain = ... 'A') >>> template.annotation {'seq_type': 'structure', 'pdb_code': 'TMPL', 'begin_res': '1', 'begin_chain': 'A', 'end_res': '8', 'end_chain': 'A', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''}
- get_annotation_pir() str
Return annotation in the colon-separated format expected from the PIR alignment format used by MODELLER.
- Returns:
Annotation in PIR format
- Return type:
str
Examples
>>> template = hm.alignment.Sequence(name = 'template', sequence = ... 'TEMPLATE', seq_type = 'structure', pdb_code = 'TMPL', ... begin_res = '1', begin_chain = 'A', end_res = '8', end_chain = ... 'A') >>> template.get_annotation_pir() 'structure:TMPL:1:A:8:A::::'
- get_annotation_print() None
Print annotation to console
- Return type:
None
Examples
>>> template = hm.alignment.Sequence(name = 'template', sequence = ... 'TEMPLATE', seq_type = 'structure', pdb_code = 'TMPL', ... begin_res = '1', begin_chain = 'A', end_res = '8', end_chain = ... 'A') >>> template.get_annotation_print() Sequence Type structure PDB ID TMPL Start Residue 1 Start Chain A End Residue 8 End Chain A Protein Name Protein Source Resolution R-Factor
- get_gaps() tuple
Find gap positions in sequence
- Returns:
Positions of gaps in sequence
- Return type:
tuple
Examples
>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE') >>> seq.get_gaps() (3, 7, 8)
- remove_gaps(remove_all: bool = False, positions: Optional[Iterable[int]] = None) None
Remove gaps from the sequence.
Gaps in the alignment are symbolized by ‘-’. Removal can either happen at specific or all positions. Indexing for specific positions is zero-based and checked before removal (raises Warning if the attempted removal of a non-gap position is detected)
- Parameters:
remove_all (bool) – Remove all gaps (default False)
positions (iterable) – Positions to remove (zero-based indexing)
- Return type:
None
- Warns:
UserWarning – Specified position is not a gap
Examples
Example 1: remove all
>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE') >>> seq.remove_gaps(remove_all = True) >>> seq.sequence 'SEQUENCE'
Example 2: selective removal
>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE') >>> seq.remove_gaps(positions = (7, 8)) >>> seq.sequence 'SEQ-UENCE'
- class homelette.alignment.AlignmentGenerator(sequence: str, target: str = 'target', template_location: str = './templates/')
Bases:
ABC
Parent class for the auto-generation of alignments and template selection based on sequence input.
- Parameters:
sequence (str) – Target sequence in 1 letter amino acid code.
target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state – Dictionary describing the state of the AlignmentGenerator object
- Return type:
None
- abstract get_suggestion()
Generate suggestion for templates and alignment
- classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator
Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.
- Parameters:
fasta_file (str) – Fasta file from which the first sequence will be read.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Return type:
- Raises:
ValueError – Fasta file not properly formatted
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- class homelette.alignment.AlignmentGenerator_pdb(sequence: str, target: str = 'target', template_location: str = './templates/')
Bases:
AlignmentGenerator
Identification of templates using the RCSB search API, generation of alignment using Clustal Omega and download and processing of template structures.
- Parameters:
sequence (str) – Target sequence in 1 letter amino acid code.
target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state – Dictionary describing the state of the AlignmentGenerator object
- Return type:
None
Notes
The AlignmentGenerator uses the RCSB Search API [1] to identify potential template structures given the target sequence using MMseq2 [2]. The sequences of the potentially downloaded and locally aligned using Clustal Omega [3] [4].
References
- get_suggestion(seq_id_cutoff: float = 0.5, min_length: int = 30, max_results: int = 50, xray_only: bool = True, verbose: bool = True) None
Identifies potential templates, retrieves their sequences and aligns them locally using Clustal Omega.
- Parameters:
seq_id_cutoff (float) – The sequence identity cutoff for the identification of template structures. Templates below this threshold will be ignored (default 0.5).
min_length (int) – The minimum length of template sequence to be included in the results (default 30 amino acids).
max_results (int) – The number of results returned (default 50).
xray_only (bool) – Only consider templates structures generated with X-ray crystallography (default True).
verbose (bool) – Explain what is done (default True).
- Return type:
None
- Raises:
RuntimeError – Alignment already generated.
- classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator
Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.
- Parameters:
fasta_file (str) – Fasta file from which the first sequence will be read.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Return type:
- Raises:
ValueError – Fasta file not properly formatted
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- class homelette.alignment.AlignmentGenerator_hhblits(sequence: str, target: str = 'target', template_location: str = './templates/')
Bases:
AlignmentGenerator
Identification of templates using hhblits to search a local PDB database, generation of alignment by combining pairwise alignments of target and template together.
- Parameters:
sequence (str) – Target sequence in 1 letter amino acid code.
target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state – Dictionary describing the state of the AlignmentGenerator object.
- Return type:
None
Notes
HHblits from the HHsuite [5] is used to query the databases. The resulting pairwise sequence alignments of template to target are combined using the target sequence as the master sequence. The resulting alignment is therefore, strictly speaking, not a proper multiple sequence alignment. However, all information from the pairwise alignments is preserved, and for homology modelling, the alignments of templates among each others do not have any influence.
References
- get_suggestion(database_dir: str = './databases/', use_uniref: bool = False, evalue_cutoff: float = 0.001, iterations: int = 2, n_threads: int = 2, neffmax: float = 10.0, verbose: bool = True) None
Use HHblits to identify template structures and create a multiple sequence alignment by combination of pairwise alignments on target sequence.
- Parameters:
database_dir (str) – The directory where the pdb70 (and the UniRef30) database are stored (default ./databases/).
use_uniref (bool) – Use UniRef30 to create a MSA before querying the pdb70 database (default False). This leads to better results, however it takes longer and requires the UniRef30 database on your system.
evalue_cutoff (float) – E-value cutoff for inclusion in result alignment (default 0.001)
iterations (int) – Number of iterations when querying the pdb70 database.
n_threads (int) – Number of threads when querying the pdb70 (or UniRef30) database (default 2).
neffmax (float) – The neffmax value used when querying the pdb70 database (default 10.0).
verbose (bool) – Explain which operations are performed (default True).
- Return type:
None
- Raises:
RuntimeError – Alignment has already been generated.
Notes
This function expects “hhblits” to be installed and in the path. In addition, the pdb70 database needs to be downloaded and extracted in the database_dir. The files need to be called “pdb70_*” for hhblits to correctly find the database. If UniRef30 is used to create a pre-alignment for better results, the UniRef30 database needs to be downloaded and extracted in the database_dir. The files need to be called “UniRef30_*”.
For more information on neffmax, please check the hhblits documentation.
If UniRef30 is used to generate a prealignment, then hhblits will be called for one iteration with standard parameters.
- classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator
Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.
- Parameters:
fasta_file (str) – Fasta file from which the first sequence will be read.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Return type:
- Raises:
ValueError – Fasta file not properly formatted
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- class homelette.alignment.AlignmentGenerator_from_aln(alignment_file: str, target: str, template_location: str = './templates/', file_format: str = 'fasta')
Bases:
AlignmentGenerator
Reads an alignment from file into the AlignmentGenerator workflow.
- Parameters:
alignment_file (str) – The file to read the alignment from.
target (str) – The name of the target sequence in the alignment.
template_location (str) – Directory where processed templates will be stored (default ‘./templates/’).
file_format (str, optional) – The format of the alignment file. Can be ‘fasta’ or ‘pir’ (default ‘fasta’).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state (dict) – Dictionary describing the state of the AlignmentGenerator object.
- Return type:
None
Notes
Useful for making use of the PDB download and processing functions that come with the AlignmentGenerator classes.
- get_suggestion()
Not implemented, since alignment is read from file on initialization.
- Raises:
NotImplementedError –
- from_fasta(*args, **kwargs)
Not implemented, since alignment is read from file on initialization.
- Raises:
NotImplementedError –
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- homelette.alignment.assemble_complex_aln(*args: Type[Alignment], names: dict) Type[Alignment]
Assemble complex alignments compatible with MODELLER from individual alignments.
- Parameters:
*args (Alignment) – The input alignments
names (dict) – Dictionary instructing how sequences in the different alignment objects are supposed to be arranged in the complex alignment. The keys are the names of the sequences in the output alignments. The values are iterables of the sequence names from the input alignments in the order they are supposed to appaer in the output alignment. Any value that can not be found in the alignment signals that this position in the complex alignment should be filled with gaps.
- Returns:
Assembled complex alignment
- Return type:
Examples
>>> aln1 = hm.Alignment(None) >>> aln1.sequences = { ... 'seq1_1': hm.alignment.Sequence('seq1_1', 'HELLO'), ... 'seq2_1': hm.alignment.Sequence('seq2_1', 'H---I'), ... 'seq3_1': hm.alignment.Sequence('seq3_1', '-HI--') ... } >>> aln2 = hm.Alignment(None) >>> aln2.sequences = { ... 'seq2_2': hm.alignment.Sequence('seq2_2', 'KITTY'), ... 'seq1_2': hm.alignment.Sequence('seq1_2', 'WORLD') ... } >>> names = {'seq1': ('seq1_1', 'seq1_2'), ... 'seq2': ('seq2_1', 'seq2_2'), ... 'seq3': ('seq3_1', 'gaps') ... } >>> aln_assembled = hm.alignment.assemble_complex_aln( ... aln1, aln2, names=names) >>> aln_assembled.print_clustal() seq1 HELLO/WORLD seq2 H---I/KITTY seq3 -HI--/-----
homelette.routines
The homelette.routines
submodule contains classes for model generation.
Routines are the building blocks that are used to generate homology models.
Currently, a number of pre-implemented routines based on MODELLER, altMOD and ProMod3 are available. It is possible to implement custom routines for model generation and use them in the homelette framework.
Tutorials
The basics of generating homology models with pre-implemented modelling routines are presented in Tutorial 2. Complex modelling with homelette is introduced in Tutorial 6. Implementing custom modelling routines is discussed in Tutorial 4. Assembling custom pipelines is discussed in Tutorial 7.
Classes
The following standard modelling routines are implemented:
Modelling routines for loop modelliing:
Specifically for the modelling of complex structures, the following routines are implemented:
- class homelette.routines.Routine_automodel_default(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling using the automodel class from modeller with a default parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation (default 1)
n_models (int) – Number of models generated (default 1)
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
modeller.automodel.automodel
library_schedule
modeller.automodel.autosched.normal
md_level
modeller.automodel.refine.very_fast
max_var_iterations
200
repeat_optmization
1
- generate_models() None
Generate models with the parameter set automodel_default.
- Return type:
None
- class homelette.routines.Routine_automodel_slow(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling using the automodel class from modeller with a slow parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation
n_models (int) – Number of models generated
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
modeller.automodel.automodel
library_schedule
modeller.automodel.autosched.slow
md_level
modeller.automodel.refine.very_slow
max_var_iterations
400
repeat_optmization
3
- generate_models() None
Generate models with the parameter set automodel_slow.
- Return type:
None
- class homelette.routines.Routine_altmod_default(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling using the Automodel_statistical_potential class from altmod with a default parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation
n_models (int) – Number of models generated
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (list) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
altmod.Automodel_statistical_potential
library_schedule
modeller.automodel.autosched.normal
md_level
modeller.automodel.refine.very_fast
max_var_iterations
200
repeat_optmization
1
Autmodel_statistical_potential uses the DOPE potential for model refinement.
- generate_models() None
Generate models with the parameter set altmod_default.
- Return type:
None
- class homelette.routines.Routine_altmod_slow(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling using the Automodel_statistical_potential class from altmod with a slow parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation
n_models (int) – Number of models generated
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (list) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
altmod.Automodel_statistical_potential
library_schedule
modeller.automodel.autosched.slow
md_level
modeller.automodel.refine.very_slow
max_var_iterations
400
repeat_optmization
3
Autmodel_statistical_potential uses the DOPE potential for model refinement.
- generate_models() None
Generate models with the parameter set altmod_slow.
- Return type:
None
- class homelette.routines.Routine_promod3(alignment: Type[Alignment], target: str, templates: Iterable, tag: str)
Class for performing homology modelling using the ProMod3 engine with default parameters.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (iterable) – The iterable containing the identifier of the template used for the modelling
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (iterable) – The iterable containing the identifier of the template used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
routine (str) – The identifier associated with this specific routine: promod3
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
ValueError – Number of given templates is not 1
- generate_models() None
Generate models with the ProMod3 engine with default parameters.
- Return type:
None
- class homelette.routines.Routine_loopmodel_default(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, loop_selections: Iterable, n_models: int = 1, n_loop_models: int = 1)
Class for performing homology loop modelling using the loopmodel class from modeller with a default parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
loop_selections (Iterable) – Selection(s) with should be refined with loop modelling, in modeller format (example: [[‘18:A’, ‘22:A’], [‘29:A’, ‘33:A’]])
n_models (int) – Number of models generated (default 1)
n_loop_models (int) – Number of loop models generated for each model (default 1)
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
loop_selections (Iterable) – Selection(s) with should be refined with loop modelling
n_models (int) – Number of models generated
n_loop_models (int) – Number of loop models generated for each model
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
loop_selections
n_models
n_loop_models
The following modelling parameters are set for this class:
modelling parameter
value
model_class
modeller.automodel.LoopModel
library_schedule
modeller.automodel.autosched.normal
md_level
modeller.automodel.refine.very_fast
max_var_iterations
200
repeat_optmization
1
loop_library_schedule
modeller.automodel.autosched.loop
loop_md_level
modeller.automodel.refine.slow
loop_max_var_iterations
200
n_threads
1
- generate_models() None
Generate models with the parameter set loopmodel_default.
- Return type:
None
- class homelette.routines.Routine_loopmodel_slow(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, loop_selections: Iterable, n_models: int = 1, n_loop_models: int = 1)
Class for performing homology loop modelling using the loopmodel class from modeller with a slow parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
loop_selections (Iterable) – Selection(s) with should be refined with loop modelling, in modeller format (example: [[‘18:A’, ‘22:A’], [‘29:A’, ‘33:A’]])
n_models (int) – Number of models generated (default 1)
n_loop_models (int) – Number of loop models generated for each model (default 1)
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
loop_selections (Iterable) – Selection(s) with should be refined with loop modelling
n_models (int) – Number of models generated
n_loop_models (int) – Number of loop models generated for each model
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
loop_selections
n_models
n_loop_models
The following modelling parameters are set for this class:
modelling parameter
value
model_class
modeller.automodel.LoopModel
library_schedule
modeller.automodel.autosched.slow
md_level
modeller.automodel.refine.very_slow
max_var_iterations
400
repeat_optmization
3
loop_library_schedule
modeller.automodel.autosched.slow
loop_md_level
modeller.automodel.refine.very_slow
loop_max_var_iterations
400
n_threads
1
- generate_models() None
Generate models with the parameter set loopmodel_slow.
- Return type:
None
- class homelette.routines.Routine_complex_automodel_default(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling of complexes using the automodel class from modeller with a default parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation (default 1)
n_models (int) – Number of models generated (default 1)
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
modeller.automodel.automodel
library_schedule
modeller.automodel.autosched.normal
md_level
modeller.automodel.refine.very_fast
max_var_iterations
200
repeat_optmization
1
- generate_models() None
Generate complex models with the parameter set automodel_default.
- Return type:
None
- class homelette.routines.Routine_complex_automodel_slow(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling of complexes using the automodel class from modeller with a slow parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation (default 1)
n_models (int) – Number of models generated (default 1)
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (Iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
modeller.automodel.automodel
library_schedule
modeller.automodel.autosched.slow
md_level
modeller.automodel.refine.very_slow
max_var_iterations
400
repeat_optmization
3
- generate_models() None
Generate complex models with the parameters set automodel_slow.
- Return type:
None
- class homelette.routines.Routine_complex_altmod_default(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling of complexes using the Automodel_statistical_potential class from altmod with a default parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation
n_models (int) – Number of models generated
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (list) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
altmod.Automodel_statistical_potential
library_schedule
modeller.automodel.autosched.normal
md_level
modeller.automodel.refine.very_fast
max_var_iterations
200
repeat_optmization
1
Autmodel_statistical_potential uses the DOPE potential for model refinement.
- generate_models() None
Generate complex models with the parameter set altmod_default.
- Return type:
None
- class homelette.routines.Routine_complex_altmod_slow(alignment: Type[Alignment], target: str, templates: Iterable, tag: str, n_threads: int = 1, n_models: int = 1)
Class for performing homology modelling of complexes using the Automodel_statistical_potential class from altmod with a slow parameter set.
- Parameters:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (iterable) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used in model generation
n_models (int) – Number of models generated
- Variables:
alignment (Alignment) – The alignment object that will be used for modelling
target (str) – The identifier of the protein to model
templates (list) – The iterable containing the identifier(s) of the template(s) used for the modelling
tag (str) – The identifier associated with a specific execution of the routine
n_threads (int) – Number of threads used for model generation
n_models (int) – Number of models generated
routine (str) – The identifier associated with a specific routine
models (list) – List of models generated by the execution of this routine
- Raises:
ImportError – Unable to import dependencies
Notes
The following modelling parameters can be set when initializing this Routine object:
n_models
n_threads
The following modelling parameters are set for this class:
modelling parameter
value
model_class
altmod.Automodel_statistical_potential
library_schedule
modeller.automodel.autosched.slow
md_level
modeller.automodel.refine.very_slow
max_var_iterations
400
repeat_optmization
3
Autmodel_statistical_potential uses the DOPE potential for model refinement.
- generate_models() None
Generate complex models with the parameter set altmod_slow.
- Return type:
None
homelette.evaluation
The homelette.evaluation
submodule contains different classes for
evaluating homology models.
It is possible to implement custom Evaluation building blocks and use them in the homelette framework.
Tutorials
Working with model evaluations in homelette is discussed in detail in Tutorial 3. Implementing custom evaluation metrics is discussed in Tutorial 4. Assembling custom pipelines is discussed in Tutorial 7.
Classes
The following evaluation metrics are implemented:
- class homelette.evaluation.Evaluation_dope(model: Type[Model], quiet: bool = False)
Class for evaluating a model with DOPE score.
Will dump the following entries to the model.evaluation dictionary:
dope
dope_z_score
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- Raises:
ImportError – Unable to import dependencies
Notes
DOPE is a staticial potential for the evaluation of homology models [1]. For further information, please check the modeller documentation or the associated publication.
References
- evaluate() None
Run DOPE evaluation. Automatically called on object initialization
- Return type:
None
- class homelette.evaluation.Evaluation_soap_protein(model: Type[Model], quiet: bool = False)
Class for evaluating a model with the SOAP protein protential.
Will dump the following entries to the model.evaluation dictionary:
soap_protein
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- Raises:
ImportError – Unable to import dependencies
Notes
SOAP is a statistical potential for evaluating homology models [2]. For more information, please check the modeller and SOAP documentations or the associated publication.
References
- evaluate() None
Run SOAP protein evaluation. Automatically called on object initialization
- Return type:
None
- class homelette.evaluation.Evaluation_soap_pp(model: Type[Model], quiet: bool = False)
Class for evaluating a model with SOAP interaction potentials. This is used for the evaluation of models of protein complexes.
Will dump the following entries to the model.evaluation dictionary:
soap_pp_all
soap_pp_atom
soap_pp_pair
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- Raises:
ImportError – Unable to import dependencies
Notes
SOAP is a statistical potential for evaluating homology models [3]. For more information, please check the modeller and SOAP documentations or the associated publication.
References
- evaluate() None
Run SOAP interaction evaluation. Automatically called on object initialization
- Return type:
None
- class homelette.evaluation.Evaluation_qmean4(model: Type[Model], quiet: bool = False)
Class for evaluating a model with the QMEAN4 potential.
Will dump the following entries to the model.evaluation dictionary:
qmean4
qmean4_z_score
- Parameters:
model (Model) – The model object to evaluate.
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- Raises:
ImportError – Unable to import dependencies
See also
Notes
QMEAN is a statistical potential for evaluating homology models [4] [5].
Briefly, QMEAN is a combination of different components. Four compoenents (interaction, cbeta, packing and torsion) form the qmean4 score.
For more information, please check the QMEAN documentation or the associated publications.
References
- evaluate() None
Run QMEAN4 protein evaluation. Automatically called on object initialization :rtype: None
- class homelette.evaluation.Evaluation_qmean6(model: Type[Model], quiet: bool = False)
Class for evaluating a model with the QMEAN6 potential.
Will dump the following entries to the model.evaluation dictionary:
qmean6
qmean6_disco
Requires the following valid entries in the model.info dictionary:
accpro_file (.acc file)
psipred_file (.horiz file)
- Parameters:
model (Model) – The model object to evaluate.
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- Raises:
ImportError – Unable to import dependencies
See also
Notes
QMEAN is a statistical potential for evaluating homology models [6] [7].
QMEAN6 is a combination of six different components (interaction, cbeta, packing, torsion, ss_agreement, acc_agreement). It is an extension to the QMEAN4 score, which additionally evaluates the agreement of the model to secondary structur predictions from PSIPRED [8] and solvent accessiblity predictions from ACCpro [9].
For more information, please check the QMEAN documentation or the associated publications.
References
- evaluate() None
Run QMEAN6 protein evaluation. Automatically called on object initialization
- Return type:
None
- class homelette.evaluation.Evaluation_qmeandisco(model: Type[Model], quiet: bool = False)
Class for evaluating a model with the QMEAN DisCo potential.
Will dump the following entries to the model.evaluation dictionary:
qmean6
qmean6_z_score
qmean_local_scores_avg
qmean_local_scores_err
Requires the following valid entries in the model.info dictionary:
accpro_file (.acc file)
psipred_file (.horiz file)
disco_file (generated by
qmean.DisCoContainer.Save
)
- Parameters:
model (Model) – The model object to evaluate.
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- Raises:
ImportError – Unable to import dependencies
See also
Notes
QMEAN is a statistical potential for evaluating homology models [10] [11].
QMEAN DisCo is an extension of QMEAN by the inclusion of homology derived DIStance COnstraints [12]. These distance contraints do not influence the six component of the QMEAN6 score (interaction, cbeta, packing, torsion, ss_agreement, acc_agreement), but only the local scores.
The distance contraints for the target have to be generated before and saved to a file.
For more information, please check the QMEAN documentation or the associated publications.
References
- evaluate() None
Run QMEAN DisCo protein evaluation. Automatically called on object initialization
- Return type:
None
- class homelette.evaluation.Evaluation_mol_probity(model: Type[Model], quiet: bool = False)
Class for evaluating a model with the MolProbity validation service.
Will dump the following entries to the model.evaluation dictionary:
mp_score
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
Notes
Molprobity is a program that evaluates the quality of 3D structures of proteins based on structural features [13] [14] [15]. For more information, please check the MolProbity webpage or the associated publications.
References
- evaluate() None
Run MolProbity evaluation. Automatically called on object initialization
- Return type:
None
homelette.pdb_io
The homelette.pdb_io
submodule contains an object for parsing and
manipulating PDB files. There are several constructor function that can read
PDB files or download them from the internet.
Functions and classes
Functions and classes present in homelette.pdb_io are listed below:
- homelette.pdb_io.read_pdb(file_name: str) PdbObject
Reads PDB from file.
- Parameters:
file_name (str) – PDB file name
- Return type:
Notes
If a PDB file with multiple MODELs is read, only the first model will be conserved.
- homelette.pdb_io.download_pdb(pdbid: str) PdbObject
Download PDB from the RCSB.
- Parameters:
pdbid (str) – PDB identifier
- Return type:
Notes
If a PDB file with multiple MODELs is read, only the first model will be conserved.
- class homelette.pdb_io.PdbObject(lines: Iterable)
Object encapsulating functionality regarding the processing of PDB files
- Parameters:
lines (Iterable) – The lines of the PDB
- Variables:
lines – The lines of the PDB, filtered for ATOM and HETATM records
- Return type:
None
See also
Notes
Please contruct instances of PdbObject using the constructor functions.
If a PDB file with multiple MODELs is read, only the first model will be conserved.
- write_pdb(file_name) None
Write PDB to file.
- Parameters:
file_name (str) – The name of the file to write the PDB to.
- Return type:
None
- parse_to_pd() pandas.DataFrame
Parses PDB to pandas dataframe.
- Return type:
pd.DataFrame
Notes
Information is extracted according to the PDB file specification (version 3.30) and columns are named accordingly. See https://www.wwpdb.org/documentation/file-format for more information.
- get_sequence(ignore_missing: bool = True) str
Retrieve the 1-letter amino acid sequence of the PDB, grouped by chain.
- Parameters:
ignore_missing (bool) – Changes behaviour with regards to unmodelled residues. If True, they will be ignored for generating the sequence (default). If False, they will be represented in the sequence with the character X.
- Returns:
Amino acid sequence
- Return type:
str
- get_chains() list
Extract all chains present in the PDB.
- Return type:
list
- transform_extract_chain(chain) PdbObject
Extract chain from PDB.
- Parameters:
chain (str) – The chain ID to be extracted.
- Return type:
- transform_renumber_residues(starting_res: int = 1) PdbObject
Renumber residues in PDB.
- Parameters:
starting_res (int) – Residue number to start renumbering at (default 1)
- Return type:
Notes
Missing residues in the PDB (i.e. unmodelled) will not be considered in the renumbering. If multiple chains are present in the PDB, numbering will be continued from one chain to the next one.
- transform_change_chain_id(new_chain_id) PdbObject
Replace chain ID for every entry in PDB.
- Parameters:
new_chain_id (str) – New chain ID.
- Return type:
- transform_filter_res_name(selection: Iterable, mode: str = 'out') PdbObject
Filter PDB by residue name.
- Parameters:
selection (Iterable) – For which residue names to filter
mode (str) – Filtering mode. If mode = “out”, the selection will be filtered out (default). If mode = “in”, everything except the selection will be filtered out.
- Return type:
Extensions
homelette can be extended by new building blocks. This section introduces how extensions work, and where to find them.
homelette Extensions
Extensions are homology modelling building blocks (model generation, model evaluation) that are developed by users and expand the homelette interface. homelette can and should be extended by custom Routines and Evaluations. We strongly encourage users to share extensions they themselves found useful with the community.
Using Extensions
Extensions are placed in the extension folder in the homelette package. The extension folder on your device can be found in the following way:
import homelette.extension as ext
print(ext.__file__)
After an extension has been placed in the extension folder, it can be used as such:
import homelette.extension.your_extension as ext_1
Submitting Extensions
Please contact us with a Pull Request on GitHub or via Email (philipp.junk@ucdconnect.ie) if you want to share your extension! Please make sure your extension is sufficiently annotated for others to use, in particular mentioning dependencies or other requirements.
Existing Extensions
The following extensions have already been implemented. They should be already included in the latest version of homelette. If not, they are available from our GitHub page.
FoldX extension to homelette
Philipp Junk, 2021
This extension contains evaluation metrics based on FoldX, a force field for energy calculation and protein design (https://foldxsuite.crg.eu/) [1] [2].
Usage
import homelette.extension.extension_foldx as extension_foldx
help(extension_foldx.Evaluation_foldx_stability)
This extension expects FoldX to be installed and in your path.
Functions and classes
- Currently contains the following items:
Evaluation_foldx_repairmodels
Evaluation_foldx_interaction
Evaluation_foldx_stability
Evaluation_foldx_alascan_buildmodels
Evaluation_foldx_alascan_interaction
- class homelette.extension.extension_foldx.Evaluation_foldx_repairmodels(model, quiet=False)
Creates a modified version of the PDB and runs RepairPDB on it
Will not dump an entry to the model.evaluation dictionary
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
Notes
Most PDBs work fine with FoldX. For a specific use case in which I was working with GTP heteroatoms, I had to rename a few atoms to make the PDB compliant with FoldX.
- evaluate()
Repairs models with FoldX. Automatically called on object initialization
- Return type:
None
- class homelette.extension.extension_foldx.Evaluation_foldx_interaction(model, quiet=False)
Calculates interaction energy with FoldX
Requires a protein-protein complex. Expects Evaluation_foldx_repairmodels to have been performed beforehand.
Will dump the following entries to the model.evaluation dictionary:
foldx_interaction
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- evaluate()
Calculates protein interaction energy with FoldX. Automatically called on object initialization.
- Return type:
None
- class homelette.extension.extension_foldx.Evaluation_foldx_stability(model, quiet=False)
Calculate protein stability with FoldX
Expects Evaluation_foldx_repairmodels to have been performed beforehand.
Will dump the following entries to the model.evaluation dictionary:
foldx_stability
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
- evaluate()
Calculates protein stability with FoldX. Automatically called on object initialization.
- Return type:
None
- class homelette.extension.extension_foldx.Evaluation_foldx_alascan_buildmodels(model, quiet=False)
Generates alanine point mutations for all positions in the given model using FoldX. Automatically called on object initialization.
Expects Evaluation_foldx_repairmodels to have been performed beforehand.
Will not dump an entry to the model.evaluation dictionary.
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
See also
Notes
This Evaluation is very RAM intensive, so expect only to run 1 or 2 threads ni parallel.
- evaluate()
Generates alanine point mutations for all positions in the given model. Automatically called on object initialization.
- Return type:
None
- class homelette.extension.extension_foldx.Evaluation_foldx_alascan_interaction(model, quiet=False)
Calculates protein interaction energy with FoldX for all alanine point mutations generated by Evaluation_foldx_alascan_buildmodels.
Expects Evaluation_foldx_alascan_buildmodels to have been run before.
Will dump the following entry to the model.evaluation dictionary:
- foldx_alascan: Dictionary of all interaction energies for all alanine
scan mutations.
- Parameters:
model (Model) – The model object to evaluate
quiet (bool) – If True, will perform evaluation with suppressing stdout (default False). Needs to be False for running it asynchronously, as done when running Task.evaluate_models with multple cores
- Variables:
model (Model) – The model object to evaluate
output (dict) – Dictionary that all outputs will be dumped into
See also
- evaluate()
Calculates protein interaction energy with FoldX for all alanine point mutations generated by Evaluation_foldx_alascan_buildmodels.
- Return type:
None