Tutorial 5: Parallelization

[1]:

import homelette as hm

import time

Introduction

Welcome to the fifth tutorial on homelette. This tutorial is about parallelization in homelette. When modelling hundreds or thousands of models, some processes can be significantly sped up by dividing the workload on multiple processes in parallel (supported by appropriate hardware).

There are possibilities to parallelize both model generation and evaluation in homelette.

Alignment and Task setup

For this tutorial, we are using the same alignment as in Tutorial 1. Identical to previous tutorials, the alignment is imported and annotated, and a Task object is set up.

[2]:

# read in the alignment
aln = hm.Alignment('data/single/aln_1.fasta_aln')

# annotate the alignment
aln.get_sequence('ARAF').annotate(
    seq_type = 'sequence')
aln.get_sequence('3NY5').annotate(
    seq_type = 'structure',
    pdb_code = '3NY5',
    begin_res = '1',
    begin_chain = 'A',
    end_res = '81',
    end_chain = 'A')

# initialize task object
t = hm.Task(
    task_name = 'Tutorial5',
    target = 'ARAF',
    alignment = aln,
    overwrite = True)

Parallel model generation

When trying to parallelize model generation, homelette makes use of the parallelization methods implemented in the packages that homelette uses, if they are available. Model generation with modeller can be parallized and is available in homelette through a simple handler [1,2].

All modeller based, pre-implemented routines have the argument n_threads which can be used to use parallelization. The default is n_threads = 1 which does not activate parallelization, but any number > 1 will distribute the workload on the number of threads requested using the modeller.parallel submodule.

[3]:

# use only 1 thread to generate 20 models
start = time.perf_counter()
t.execute_routine(
    tag = '1_thread',
    routine = hm.routines.Routine_automodel_default,
    templates = ['3NY5'],
    template_location = './data/single/',
    n_models = 20)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')

Elapsed time: 47.84

[4]:

# use 4 threads to generate 20 models faster
start  = time.perf_counter()
t.execute_routine(
    tag = '4_threads',
    routine = hm.routines.Routine_automodel_default,
    templates = ['3NY5'],
    template_location = './data/single/',
    n_models = 20,
    n_threads = 4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')

Elapsed time: 15.44

Using multiple threads can significantly speed up model generation, especially if a large number of models is generated.

Note

Please be aware that the modeller.parallel submodule uses the Python module pickle, which requires objects to be pickled to be saved in a separate file. In practical terms, if you want to run parallelization in modeller with a custom object (i.e. a custom defined routine, see Tutorial 4), you cannot make use of parallelization unless you have imported it from a separate file. Therefore we recommend that custom routines and evaluation are saved in a separate file and then imported from there.

The following code block shows how custom building blocks could be put in an external file (data/extension.py) and then imported for modelling and analysis.

[5]:

# import from custom file
from data.extension import Custom_Routine, Custom_Evaluation

?Custom_Routine

Init signature: Custom_Routine()
Docstring:      Custom routine waiting to be implemented.
File:           ~/workdir/data/extension.py
Type:           type
Subclasses:

[6]:

!cat data/extension.py

'''
Examples of custom objects for homelette in a external file.
'''


class Custom_Routine():
    '''
    Custom routine waiting to be implemented.
    '''
    def __init__(self):
        print('TODO: implement this')


class Custom_Evaluation():
    '''
    Custom evaluation waiting to be implemented.
    '''
    def __init__(self):
        print('TODO: implement this')

Alternatively, you could use the /homelette/extension/ folder in which extensions are stored. See our comments on extensions in our documentation for more details.

Parallel model evaluation

homelette can also use parallelization to speed up model evaluation. This is internally archieved by using concurrent.futures.ThreadPoolExecutor.

In order to use parallelization when performing evaluations, use the n_threads argument in Task.evaluate_models.

[7]:

# use 1 thread for model evaluation
start  = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=1)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')

Elapsed time: 468.37

[8]:

# use 4 threads for model evaluation
start  = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_mol_probity, n_threads=4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')

Elapsed time: 128.37

For some evaluation schemes, using parallelization can lead to a significant speedup.

Note

Please be advised that for some (very fast) evaluation methods, the time investment of spawning new child processes might not compensate for the speedup gained by parallelization. Test your usecase on your system in a small setting and use at your own discretion.

[9]:

# use 1 thread for model evaluation
start  = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=1)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')

Elapsed time: 10.34

[10]:

# use 4 threads for model evaluation
start  = time.perf_counter()
t.evaluate_models(hm.evaluation.Evaluation_dope, n_threads=4)
print(f'Elapsed time: {time.perf_counter() - start:.2f}')

Elapsed time: 15.95

Note

When creating and using custom evaluation metrics, please make sure to avoid race conditions. Task.evaluate_models is implemented with a protection against race conditions, but this is not bulletproof. Also, if you need to create temporary files, make sure to create file names with model-specific names (i.e. by using the model name in the file name). Defining custom evaluations in a separate file is not necessary, as parallelization of evaluation methods does not rely on pickle.

Note

In case some custom evaluation metrics are very memory-demanding, running it in parallel can easily overwhelm the system. Again, we encourage you to test your usecase on your system in a small setting.

References

[1] Šali, A., & Blundell, T. L. (1993). Comparative protein modelling by satisfaction of spatial restraints. Journal of Molecular Biology, 234(3), 779–815. https://doi.org/10.1006/jmbi.1993.1626

[2] Webb, B., & Sali, A. (2016). Comparative Protein Structure Modeling Using MODELLER. Current Protocols in Bioinformatics, 54(1), 5.6.1-5.6.37. https://doi.org/10.1002/cpbi.3

Session Info

[11]:

# session info
import session_info
session_info.show(html = False, dependencies = True)

-----
data                NA
homelette           1.4
session_info        1.0.0
-----
PIL                 7.0.0
altmod              NA
anyio               NA
asttokens           NA
attr                19.3.0
babel               2.12.1
backcall            0.2.0
certifi             2022.12.07
chardet             3.0.4
charset_normalizer  3.1.0
comm                0.1.2
cycler              0.10.0
cython_runtime      NA
dateutil            2.8.2
debugpy             1.6.6
decorator           4.4.2
executing           1.2.0
fastjsonschema      NA
idna                3.4
importlib_metadata  NA
importlib_resources NA
ipykernel           6.21.3
ipython_genutils    0.2.0
jedi                0.18.2
jinja2              3.1.2
json5               NA
jsonschema          4.17.3
jupyter_events      0.6.3
jupyter_server      2.4.0
jupyterlab_server   2.20.0
kiwisolver          1.0.1
markupsafe          2.1.2
matplotlib          3.1.2
modeller            10.4
more_itertools      NA
mpl_toolkits        NA
nbformat            5.7.3
numexpr             2.8.4
numpy               1.24.2
ost                 2.3.1
packaging           20.3
pandas              1.5.3
parso               0.8.3
pexpect             4.8.0
pickleshare         0.7.5
pkg_resources       NA
platformdirs        3.1.1
prometheus_client   NA
promod3             3.2.1
prompt_toolkit      3.0.38
psutil              5.5.1
ptyprocess          0.7.0
pure_eval           0.2.2
pydev_ipython       NA
pydevconsole        NA
pydevd              2.9.5
pydevd_file_utils   NA
pydevd_plugins      NA
pydevd_tracing      NA
pygments            2.14.0
pyparsing           2.4.6
pyrsistent          NA
pythonjsonlogger    NA
pytz                2022.7.1
qmean               NA
requests            2.28.2
rfc3339_validator   0.1.4
rfc3986_validator   0.1.1
send2trash          NA
sitecustomize       NA
six                 1.12.0
sniffio             1.3.0
stack_data          0.6.2
swig_runtime_data4  NA
tornado             6.2
traitlets           5.9.0
urllib3             1.26.15
wcwidth             NA
websocket           1.5.1
yaml                6.0
zipp                NA
zmq                 25.0.1
-----
IPython             8.11.0
jupyter_client      8.0.3
jupyter_core        5.2.0
jupyterlab          3.6.1
notebook            6.5.3
-----
Python 3.8.10 (default, Nov 14 2022, 12:59:47) [GCC 9.4.0]
Linux-4.15.0-206-generic-x86_64-with-glibc2.29
-----
Session information updated at 2023-03-15 23:56

Tutorial 5: Parallelization

Introduction

Alignment and Task setup

Parallel model generation

Parallel model evaluation

Further reading

References

Session Info