Tutorial 5: Parallelization
[1]:
import homelette as hm
import time
Introduction
Welcome to the fifth tutorial on homelette
. This tutorial is about parallelization in homelette
. When modelling hundreds or thousands of models, some processes can be significantly sped up by dividing the workload on multiple processes in parallel (supported by appropriate hardware).
There are possibilities to parallelize both model generation and evaluation in homelette
.
Alignment and Task setup
For this tutorial, we are using the same alignment as in Tutorial 1. Identical to previous tutorials, the alignment is imported and annotated, and a Task
object is set up.
[2]:
# read in the alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')
# annotate the alignment
aln.get_sequence('ARAF').annotate(
seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(
seq_type = 'structure',
pdb_code = '3NY5',
begin_res = '1',
begin_chain = 'A',
end_res = '81',
end_chain = 'A')
# initialize task object
t = hm.Task(
task_name = 'Tutorial5',
target = 'ARAF',
alignment = aln,
overwrite = True)
Parallel model generation
When trying to parallelize model generation, homelette
makes use of the parallelization methods implemented in the packages that homelette
uses, if they are available. Model generation with modeller
can be parallized and is available in homelette
through a simple handler [1,2].
All modeller
based, pre-implemented routines have the argument n_threads
which can be used to use parallelization. The default is n_threads = 1
which does not activate parallelization, but any number > 1 will distribute the workload on the number of threads requested using the modeller.parallel
submodule.
[3]:
# use only 1 thread to generate 20 models
start = time.perf_counter()
t.execute_routine(
tag = '1_thread',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single/',
n_models = 20)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 47.84
[4]:
# use 4 threads to generate 20 models faster
start = time.perf_counter()
t.execute_routine(
tag = '4_threads',
routine = hm.routines.Routine_automodel_default,
templates = ['3NY5'],
template_location = './data/single/',
n_models = 20,
n_threads = 4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 15.44
Using multiple threads can significantly speed up model generation, especially if a large number of models is generated.
Note
Please be aware that the modeller.parallel
submodule uses the Python module pickle
, which requires objects to be pickled to be saved in a separate file. In practical terms, if you want to run parallelization in modeller with a custom object (i.e. a custom defined routine, see Tutorial 4), you cannot make use of parallelization unless you have imported it from a separate file. Therefore we recommend that custom routines and evaluation are saved in a separate file and then imported
from there.
The following code block shows how custom building blocks could be put in an external file (data/extension.py
) and then imported for modelling and analysis.
[5]:
# import from custom file
from data.extension import Custom_Routine, Custom_Evaluation
?Custom_Routine
Init signature: Custom_Routine()
Docstring: Custom routine waiting to be implemented.
File: ~/workdir/data/extension.py
Type: type
Subclasses:
[6]:
!cat data/extension.py
'''
Examples of custom objects for homelette in a external file.
'''
class Custom_Routine():
'''
Custom routine waiting to be implemented.
'''
def __init__(self):
print('TODO: implement this')
class Custom_Evaluation():
'''
Custom evaluation waiting to be implemented.
'''
def __init__(self):
print('TODO: implement this')
Alternatively, you could use the /homelette/extension/
folder in which extensions are stored. See our comments on extensions in our documentation for more details.
Parallel model evaluation
homelette
can also use parallelization to speed up model evaluation. This is internally archieved by using concurrent.futures.ThreadPoolExecutor
.
In order to use parallelization when performing evaluations, use the n_threads
argument in Task.evaluate_models
.
[7]:
# use 1 thread for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=1)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 468.37
[8]:
# use 4 threads for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 128.37
For some evaluation schemes, using parallelization can lead to a significant speedup.
Note
Please be advised that for some (very fast) evaluation methods, the time investment of spawning new child processes might not compensate for the speedup gained by parallelization. Test your usecase on your system in a small setting and use at your own discretion.
[9]:
# use 1 thread for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=1)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 10.34
[10]:
# use 4 threads for model evaluation
start = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')
Elapsed time: 15.95
Note
When creating and using custom evaluation metrics, please make sure to avoid race conditions. Task.evaluate_models
is implemented with a protection against race conditions, but this is not bulletproof. Also, if you need to create temporary files, make sure to create file names with model-specific names (i.e. by using the model name in the file name). Defining custom evaluations in a separate file is not necessary, as parallelization of evaluation methods does not rely on pickle
.
Note
In case some custom evaluation metrics are very memory-demanding, running it in parallel can easily overwhelm the system. Again, we encourage you to test your usecase on your system in a small setting.
Further reading
Congratulation on completing Tutorial 5 about parallelization in homelette
. Please note that there are other tutorials, which will teach you more about how to use homelette
:
Tutorial 1: Learn about the basics of
homelette
.Tutorial 2: Learn more about already implemented routines for homology modelling.
Tutorial 3: Learn about the evaluation metrics available with
homelette
.Tutorial 4: Learn about extending
homelette
’s functionality by defining your own modelling routines and evaluation metrics.Tutorial 6: Learn about modelling protein complexes.
Tutorial 7: Learn about assembling custom pipelines.
Tutorial 8: Learn about automated template identification, alignment generation and template processing.
References
[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626
[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3
Session Info
[11]:
# session info
import session_info
session_info.show(html = False, dependencies = True)
-----
data NA
homelette 1.4
session_info 1.0.0
-----
PIL 7.0.0
altmod NA
anyio NA
asttokens NA
attr 19.3.0
babel 2.12.1
backcall 0.2.0
certifi 2022.12.07
chardet 3.0.4
charset_normalizer 3.1.0
comm 0.1.2
cycler 0.10.0
cython_runtime NA
dateutil 2.8.2
debugpy 1.6.6
decorator 4.4.2
executing 1.2.0
fastjsonschema NA
idna 3.4
importlib_metadata NA
importlib_resources NA
ipykernel 6.21.3
ipython_genutils 0.2.0
jedi 0.18.2
jinja2 3.1.2
json5 NA
jsonschema 4.17.3
jupyter_events 0.6.3
jupyter_server 2.4.0
jupyterlab_server 2.20.0
kiwisolver 1.0.1
markupsafe 2.1.2
matplotlib 3.1.2
modeller 10.4
more_itertools NA
mpl_toolkits NA
nbformat 5.7.3
numexpr 2.8.4
numpy 1.24.2
ost 2.3.1
packaging 20.3
pandas 1.5.3
parso 0.8.3
pexpect 4.8.0
pickleshare 0.7.5
pkg_resources NA
platformdirs 3.1.1
prometheus_client NA
promod3 3.2.1
prompt_toolkit 3.0.38
psutil 5.5.1
ptyprocess 0.7.0
pure_eval 0.2.2
pydev_ipython NA
pydevconsole NA
pydevd 2.9.5
pydevd_file_utils NA
pydevd_plugins NA
pydevd_tracing NA
pygments 2.14.0
pyparsing 2.4.6
pyrsistent NA
pythonjsonlogger NA
pytz 2022.7.1
qmean NA
requests 2.28.2
rfc3339_validator 0.1.4
rfc3986_validator 0.1.1
send2trash NA
sitecustomize NA
six 1.12.0
sniffio 1.3.0
stack_data 0.6.2
swig_runtime_data4 NA
tornado 6.2
traitlets 5.9.0
urllib3 1.26.15
wcwidth NA
websocket 1.5.1
yaml 6.0
zipp NA
zmq 25.0.1
-----
IPython 8.11.0
jupyter_client 8.0.3
jupyter_core 5.2.0
jupyterlab 3.6.1
notebook 6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:56