homelette.alignment

The homelette.alignment submodule contains a selection of tools for handling sequences and alignments, as well as for the automatic generation of sequences from a target sequence.

Tutorials

Basic handing of alignments with homelette is demonstrated in Tutorial 1. The assembling of alignments for complex modelling is discussed in Tutorial 6. The automatic generation of alignments is shown in Tutorial 8.

Functions and classes

Functions and classes present in homelette.alignment are listed below:


class homelette.alignment.Alignment(file_name: Optional[str] = None, file_format: str = 'fasta')

Bases: object

Class for managing sequence alignments.

Parameters:
  • file_name (str, optional) – The file to read the alignment from. If no file name is given, an empty alignment object will be created (default None)

  • file_format (str, optional) – The format of the alignment file. Can be ‘fasta’ or ‘pir’ (default ‘fasta’)

Variables:

sequences (dict) – Collection of sequences. Sequences names are the dictionary keys, Sequence objects the values

Raises:

ValueError – File_format specified is not ‘fasta’ or ‘pir’

get_sequence(sequence_name: str) Type[Sequence]

Retrieve sequence object by sequence name.

Parameters:

sequence_name (str) – Name of sequence to retrieve

Return type:

Sequence

select_sequences(sequence_names: Iterable) None

Select sequences to remain in the alignment by sequence name

Parameters:

sequence_names (iterable) – Iterable of sequence names

Return type:

None

Raises:

KeyError – Sequence name not found in alignment

remove_sequence(sequence_name: str) None

Remove a sequence from the alignment by sequence name.

Parameters:

sequence_name (str) – Sequence name to remove from alignment

Return type:

None

rename_sequence(old_name: str, new_name: str) None

Rename sequence in the alignment

Parameters:
  • old_name (str) – Old name of sequence

  • new_name (str) – New name of sequence

Return type:

None

write_pir(file_name: str, line_wrap: int = 50) None

Write alignment to file in the PIR file format.

Parameters:
  • file_name (str) – File name to write to

  • line_wrap (int) – Characters per line (default 50)

Return type:

None

write_fasta(file_name: str, line_wrap: int = 80) None

Write alignment to file in the FASTA alignment file format.

Parameters:
  • file_name (str) – File name to write to

  • line_wrap (int) – Characters per line (default 80)

Return type:

None

print_clustal(line_wrap: int = 80) None

Print alignment to console in the clustal file format.

Parameters:

line_wrap (int) – Characters per line (default 80)

Return type:

None

write_clustal(file_name: str, line_wrap: int = 50) None

Write alignment to file in the clustal file format.

Parameters:
  • file_name (str) – File name to write to

  • line_wrap (int) – Characters per line (default 50)

Return type:

None

remove_redundant_gaps() None

Remove gaps in the alignment that are present in every column.

Return type:

None

replace_sequence(seq_name: str, new_sequence: str) None

Targeted replacement of sequence in alignment.

Parameters:
  • seq_name (str) – The identifier of the sequence that will be replaced.

  • new_sequence (str) – The new sequence.

Notes

This replacement is designed to introduce missing residues from template structures into the alignment and therefore has very strict requirements. The new and old sequence have to be identical, except that the new sequence might contain unmodelled residues. These are indicated by the letter ‘X’ in the new sequence, and will result in a gap ‘-’ in the alignment after replacement. It is important that all unmodelled residues, even at the start or beginning of the template sequence are correctly labeled as ‘X’.

Examples

>>> aln = hm.Alignment(None)
>>> aln.sequences = {
...     'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'),
...     'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'),
...     'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD')
...     }
>>> replacement_seq1 = 'AAAAXXXXXDDD'
>>> replacement_seq3 = 'AAXXXXDD'
>>> aln.replace_sequence('seq1', replacement_seq1)
>>> aln.print_clustal()
seq1        AAAA-----DDD
seq2        AAAAEEEEDDDD
seq3        AAAA----DDDD
>>> aln.replace_sequence('seq3', replacement_seq3)
>>> aln.print_clustal()
seq1        AAAA-----DDD
seq2        AAAAEEEEDDDD
seq3        AA--------DD
calc_identity(sequence_name_1: str, sequence_name_2: str) float

Calculate sequence identity between two sequences in the alignment.

Parameters:
  • sequence_name_1 (str) – Sequence pair to calculate identity for

  • sequence_name_2 (str) – Sequence pair to calculate identity for

Returns:

identity – Sequence identity between the two sequences

Return type:

float

Notes

There are mutiple ways of calculating sequence identity, which can be useful in different situations. Here implemented is one way which makes a lot of sence for evaluating templates for homology modelling. The sequence identity is calculated by dividing the number of matches by the length of sequence 1 (mismatches and gaps are handled identically, no gap compression).

\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]

Examples

Gaps and mismatches are treated equally.

>>> aln = hm.Alignment(None)
>>> aln.sequences = {
...     'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'),
...     'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'),
...     'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD')
...     }
>>> aln.calc_identity('seq1', 'seq2')
66.67
>>> aln.calc_identity('seq1', 'seq3')
66.67

Normalization happens for the length of sequence 1, so the order of sequences matters.

>>> aln = hm.Alignment(None)
>>> aln.sequences = {
...     'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'),
...     'seq2': hm.alignment.Sequence('seq3', 'AAAA----DDDD')
...     }
>>> aln.calc_identity('seq1', 'seq2')
66.67
>>> aln.calc_identity('seq2', 'seq1')
100.0
calc_pairwise_identity_all() Type[pandas.DataFrame]

Calculate identity between all sequences in the alignment.

Returns:

identities – Dataframe with pairwise sequence identites

Return type:

pd.DataFrame

Notes

Calculates sequence identity as descripted for calc_identity:

\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]
calc_identity_target(sequence_name: str) Type[pandas.DataFrame]

Calculate identity of all sequences in the alignment to specified target sequence.

Parameters:

sequence_name (str) – Target sequence

Returns:

identities – Dataframe with pairwise sequence identities

Return type:

pd.DataFrame

Notes

Calculates sequence identity as descripted for calc_identity:

\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]
calc_coverage(sequence_name_1: str, sequence_name_2: str) float

Calculation of coverage of sequence 2 to sequence 1 in the alignment.

Parameters:
  • sequence_name_1 (str) – Sequence pair to calculate coverage for

  • sequence_name_2 (str) – Sequence pair to calculate coverage for

Returns:

coverage – Coverage of sequence 2 to sequence 1

Return type:

float

Notes

Coverage in this context means how many of the residues in sequences 1 are assigned a residue in sequence 2. This is useful for evaluating potential templates, because a low sequence identity (as implemented in homelette) could be caused either by a lot of residues not being aligned at all, or a lot of residues being aligned but not with identical residues.

\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]

Examples

Gaps and mismatches are not treated equally.

>>> aln = hm.Alignment(None)
>>> aln.sequences = {
...     'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'),
...     'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'),
...     'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD')
...     }
>>> aln.calc_coverage('seq1', 'seq2')
100.0
>>> aln.calc_coverage('seq1', 'seq3')
66.67

Normalization happens for the length of sequence 1, so the order of sequences matters.

>>> aln = hm.Alignment(None)
>>> aln.sequences = {
...     'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'),
...     'seq2': hm.alignment.Sequence('seq3', 'AAAA----DDDD')
...     }
>>> aln.calc_coverage('seq1', 'seq2')
66.67
>>> aln.calc_coverage('seq2', 'seq1')
100.0
calc_coverage_target(sequence_name: str) Type[pandas.DataFrame]

Calculate coverage of all sequences in the alignment to specified target sequence.

Parameters:

sequence_name (str) – Target sequence

Returns:

coverages – Dataframe with pairwise coverage

Return type:

pd.DataFrame

Notes

Calculates coverage as described for calc_coverage:

\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]
calc_pairwise_coverage_all() Type[pandas.DataFrame]

Calculate coverage between all sequences in the alignment.

Returns:

coverages – Dataframe with pairwise coverage

Return type:

pd.DataFrame

Notes

Calculates coverage as described for calc_coverage:

\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]
class homelette.alignment.Sequence(name: str, sequence: str, **kwargs)

Bases: object

Class that contains individual sequences and miscellaneous information about them.

Parameters:
  • name (str) – Identifier of the sequence

  • sequence (str) – Sequence in 1 letter amino acid code

  • **kwargs – Annotations, for acceptable keys see Sequence.annotate()

Variables:
  • name (str) – Identifier of the sequence

  • sequence (str) – Sequence in 1 letter amino acid code

  • annotation (dict) – Collection of annotation for this sequence

Notes

See Sequence.annotate() for more information on the annotation of sequences.

annotate(**kwargs: str)

Change annotation for sequence object.

Keywords not specified in the Notes section will be ignored.

Parameters:

kwargs (str) – Annotations. For acceptible values, see Notes.

Return type:

None

Notes

Annotations are important for MODELLER in order to properly process alignment in PIR format. The following annotations are supported and can be modified.

annotation

explanation

seq_type

Specification whether sequence should be treated as a template (set to ‘structure’) or as a target (set to ‘sequence’)

pdb_code

PDB code corresponding to sequence (if available)

begin_res

Residue number for the first residue of the sequence in the corresponing PDB file

begin_chain

Chain identifier for the first residue of the sequence in the corresponding PDB file

end_res

Residue number for the last residue of the sequence in the corresponding PDB file

end_chain

Chain identifier for the last residue of the sequence in the corresponding PDB file

prot_name

Protein name, optional

prot_source

Protein source, optional

resolution

Resolution of PDB structure, optional

R_factor

R-factor of PDB structure, optional

Different types of annotations are required, depending whether a target or a template is annotated. For targets, it is sufficient to seq the seq_type to ‘sequence’. For templates, it is required by MODELLER that seq_type and pdb_code are annotated. begin_res, begin_chain, end_res and end_chain are recommended. The rest can be left unannoted.

Examples

Annotation for a target sequence.

>>> target = hm.alignment.Sequence(name = 'target', sequence =
...     'TARGET')
>>> target.annotation
{'seq_type': '', 'pdb_code': '', 'begin_res': '', 'begin_chain': '',
'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '',
'resolution': '', 'r_factor': ''}
>>> target.annotate(seq_type = 'sequence')
>>> target.annotation
{'seq_type': 'sequence', 'pdb_code': '', 'begin_res': '',
'begin_chain': '', 'end_res': '', 'end_chain': '', 'prot_name': '',
'prot_source': '', 'resolution': '', 'r_factor': ''}

Annotation for a template structure.

>>> template = hm.alignment.Sequence(name = 'template', sequence =
...     'TEMPLATE')
>>> template.annotation
{'seq_type': '', 'pdb_code': '', 'begin_res': '', 'begin_chain': '',
'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '',
'resolution': '', 'r_factor': ''}
>>> template.annotate(seq_type = 'structure', pdb_code = 'TMPL',
...     begin_res = '1', begin_chain = 'A', end_res = '8', end_chain =
...     'A')
>>> template.annotation
{'seq_type': 'structure', 'pdb_code': 'TMPL', 'begin_res': '1',
'begin_chain': 'A', 'end_res': '8', 'end_chain': 'A', 'prot_name': '',
'prot_source': '', 'resolution': '', 'r_factor': ''}
get_annotation_pir() str

Return annotation in the colon-separated format expected from the PIR alignment format used by MODELLER.

Returns:

Annotation in PIR format

Return type:

str

Examples

>>> template = hm.alignment.Sequence(name = 'template', sequence =
...     'TEMPLATE', seq_type = 'structure', pdb_code = 'TMPL',
...     begin_res = '1', begin_chain = 'A', end_res = '8', end_chain =
...     'A')
>>> template.get_annotation_pir()
'structure:TMPL:1:A:8:A::::'
get_annotation_print() None

Print annotation to console

Return type:

None

Examples

>>> template = hm.alignment.Sequence(name = 'template', sequence =
...     'TEMPLATE', seq_type = 'structure', pdb_code = 'TMPL',
...     begin_res = '1', begin_chain = 'A', end_res = '8', end_chain =
...     'A')
>>> template.get_annotation_print()
Sequence Type   structure
PDB ID          TMPL
Start Residue   1
Start Chain     A
End Residue     8
End Chain       A
Protein Name
Protein Source
Resolution
R-Factor
get_gaps() tuple

Find gap positions in sequence

Returns:

Positions of gaps in sequence

Return type:

tuple

Examples

>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE')
>>> seq.get_gaps()
(3, 7, 8)
remove_gaps(remove_all: bool = False, positions: Optional[Iterable[int]] = None) None

Remove gaps from the sequence.

Gaps in the alignment are symbolized by ‘-’. Removal can either happen at specific or all positions. Indexing for specific positions is zero-based and checked before removal (raises Warning if the attempted removal of a non-gap position is detected)

Parameters:
  • remove_all (bool) – Remove all gaps (default False)

  • positions (iterable) – Positions to remove (zero-based indexing)

Return type:

None

Warns:

UserWarning – Specified position is not a gap

Examples

Example 1: remove all

>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE')
>>> seq.remove_gaps(remove_all = True)
>>> seq.sequence
'SEQUENCE'

Example 2: selective removal

>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE')
>>> seq.remove_gaps(positions = (7, 8))
>>> seq.sequence
'SEQ-UENCE'
class homelette.alignment.AlignmentGenerator(sequence: str, target: str = 'target', template_location: str = './templates/')

Bases: ABC

Parent class for the auto-generation of alignments and template selection based on sequence input.

Parameters:
  • sequence (str) – Target sequence in 1 letter amino acid code.

  • target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.

  • template_location (str) – Directory where processed templates will be stored (default “./templates/”).

Variables:
  • alignment (Alignment) – The alignment.

  • target_seq (str) – The target sequence.

  • target (str) – The name of the target sequence.

  • template_location (str) – Directory where processed templates will be stored.

  • state – Dictionary describing the state of the AlignmentGenerator object

Return type:

None

abstract get_suggestion()

Generate suggestion for templates and alignment

classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator

Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.

Parameters:
  • fasta_file (str) – Fasta file from which the first sequence will be read.

  • template_location (str) – Directory where processed templates will be stored (default “./templates/”).

Return type:

AlignmentGenerator

Raises:

ValueError – Fasta file not properly formatted

show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]

Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).

Parameters:

get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.

Returns:

suggestion – DataFrame with calculated sequence identity and sequence coverage for target

Return type:

pd.DataFrame

Raises:

RuntimeError – Alignment has not been generated yet

Notes

The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.

In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.

select_templates(templates: Iterable) None

Select templates from suggested templates by identifier.

Parameters:

templates (iterable) – The selected templates as an interable.

Return type:

None

Raises:

RuntimeError – Alignment has not been generated yet

get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None

Downloads and processes templates present in alignment.

Parameters:
  • pdb_format (str) – Format of PDB identifiers in alignment (default auto)

  • verbose (bool) – Explain what operations are performed

Raises:
  • RuntimeError – Alignment has not been generated yet

  • ValueError – PDB format could not be guessed

Notes

pdb_format tells the function how to parse the template identifiers in the alignment:

  • auto: Automatic guess for pdb_format

  • entry: Sequences are named only be their PDB identifier (i.e. 4G0N)

  • entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)

  • instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)

Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).

During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.

initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task

Initialize a homelette Task object for model generation and evaluation.

Parameters:
  • task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.

  • overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).

  • task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)

Return type:

Task

Raises:

RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.

class homelette.alignment.AlignmentGenerator_pdb(sequence: str, target: str = 'target', template_location: str = './templates/')

Bases: AlignmentGenerator

Identification of templates using the RCSB search API, generation of alignment using Clustal Omega and download and processing of template structures.

Parameters:
  • sequence (str) – Target sequence in 1 letter amino acid code.

  • target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.

  • template_location (str) – Directory where processed templates will be stored (default “./templates/”).

Variables:
  • alignment (Alignment) – The alignment.

  • target_seq (str) – The target sequence.

  • target (str) – The name of the target sequence.

  • template_location (str) – Directory where processed templates will be stored.

  • state – Dictionary describing the state of the AlignmentGenerator object

Return type:

None

Notes

The AlignmentGenerator uses the RCSB Search API [1] to identify potential template structures given the target sequence using MMseq2 [2]. The sequences of the potentially downloaded and locally aligned using Clustal Omega [3] [4].

References

get_suggestion(seq_id_cutoff: float = 0.5, min_length: int = 30, max_results: int = 50, xray_only: bool = True, verbose: bool = True) None

Identifies potential templates, retrieves their sequences and aligns them locally using Clustal Omega.

Parameters:
  • seq_id_cutoff (float) – The sequence identity cutoff for the identification of template structures. Templates below this threshold will be ignored (default 0.5).

  • min_length (int) – The minimum length of template sequence to be included in the results (default 30 amino acids).

  • max_results (int) – The number of results returned (default 50).

  • xray_only (bool) – Only consider templates structures generated with X-ray crystallography (default True).

  • verbose (bool) – Explain what is done (default True).

Return type:

None

Raises:

RuntimeError – Alignment already generated.

classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator

Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.

Parameters:
  • fasta_file (str) – Fasta file from which the first sequence will be read.

  • template_location (str) – Directory where processed templates will be stored (default “./templates/”).

Return type:

AlignmentGenerator

Raises:

ValueError – Fasta file not properly formatted

get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None

Downloads and processes templates present in alignment.

Parameters:
  • pdb_format (str) – Format of PDB identifiers in alignment (default auto)

  • verbose (bool) – Explain what operations are performed

Raises:
  • RuntimeError – Alignment has not been generated yet

  • ValueError – PDB format could not be guessed

Notes

pdb_format tells the function how to parse the template identifiers in the alignment:

  • auto: Automatic guess for pdb_format

  • entry: Sequences are named only be their PDB identifier (i.e. 4G0N)

  • entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)

  • instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)

Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).

During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.

initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task

Initialize a homelette Task object for model generation and evaluation.

Parameters:
  • task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.

  • overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).

  • task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)

Return type:

Task

Raises:

RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.

select_templates(templates: Iterable) None

Select templates from suggested templates by identifier.

Parameters:

templates (iterable) – The selected templates as an interable.

Return type:

None

Raises:

RuntimeError – Alignment has not been generated yet

show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]

Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).

Parameters:

get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.

Returns:

suggestion – DataFrame with calculated sequence identity and sequence coverage for target

Return type:

pd.DataFrame

Raises:

RuntimeError – Alignment has not been generated yet

Notes

The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.

In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.

class homelette.alignment.AlignmentGenerator_hhblits(sequence: str, target: str = 'target', template_location: str = './templates/')

Bases: AlignmentGenerator

Identification of templates using hhblits to search a local PDB database, generation of alignment by combining pairwise alignments of target and template together.

Parameters:
  • sequence (str) – Target sequence in 1 letter amino acid code.

  • target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.

  • template_location (str) – Directory where processed templates will be stored (default “./templates/”).

Variables:
  • alignment (Alignment) – The alignment.

  • target_seq (str) – The target sequence.

  • target (str) – The name of the target sequence.

  • template_location (str) – Directory where processed templates will be stored.

  • state – Dictionary describing the state of the AlignmentGenerator object.

Return type:

None

Notes

HHblits from the HHsuite [5] is used to query the databases. The resulting pairwise sequence alignments of template to target are combined using the target sequence as the master sequence. The resulting alignment is therefore, strictly speaking, not a proper multiple sequence alignment. However, all information from the pairwise alignments is preserved, and for homology modelling, the alignments of templates among each others do not have any influence.

References

get_suggestion(database_dir: str = './databases/', use_uniref: bool = False, evalue_cutoff: float = 0.001, iterations: int = 2, n_threads: int = 2, neffmax: float = 10.0, verbose: bool = True) None

Use HHblits to identify template structures and create a multiple sequence alignment by combination of pairwise alignments on target sequence.

Parameters:
  • database_dir (str) – The directory where the pdb70 (and the UniRef30) database are stored (default ./databases/).

  • use_uniref (bool) – Use UniRef30 to create a MSA before querying the pdb70 database (default False). This leads to better results, however it takes longer and requires the UniRef30 database on your system.

  • evalue_cutoff (float) – E-value cutoff for inclusion in result alignment (default 0.001)

  • iterations (int) – Number of iterations when querying the pdb70 database.

  • n_threads (int) – Number of threads when querying the pdb70 (or UniRef30) database (default 2).

  • neffmax (float) – The neffmax value used when querying the pdb70 database (default 10.0).

  • verbose (bool) – Explain which operations are performed (default True).

Return type:

None

Raises:

RuntimeError – Alignment has already been generated.

Notes

This function expects “hhblits” to be installed and in the path. In addition, the pdb70 database needs to be downloaded and extracted in the database_dir. The files need to be called “pdb70_*” for hhblits to correctly find the database. If UniRef30 is used to create a pre-alignment for better results, the UniRef30 database needs to be downloaded and extracted in the database_dir. The files need to be called “UniRef30_*”.

For more information on neffmax, please check the hhblits documentation.

If UniRef30 is used to generate a prealignment, then hhblits will be called for one iteration with standard parameters.

classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator

Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.

Parameters:
  • fasta_file (str) – Fasta file from which the first sequence will be read.

  • template_location (str) – Directory where processed templates will be stored (default “./templates/”).

Return type:

AlignmentGenerator

Raises:

ValueError – Fasta file not properly formatted

get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None

Downloads and processes templates present in alignment.

Parameters:
  • pdb_format (str) – Format of PDB identifiers in alignment (default auto)

  • verbose (bool) – Explain what operations are performed

Raises:
  • RuntimeError – Alignment has not been generated yet

  • ValueError – PDB format could not be guessed

Notes

pdb_format tells the function how to parse the template identifiers in the alignment:

  • auto: Automatic guess for pdb_format

  • entry: Sequences are named only be their PDB identifier (i.e. 4G0N)

  • entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)

  • instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)

Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).

During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.

initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task

Initialize a homelette Task object for model generation and evaluation.

Parameters:
  • task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.

  • overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).

  • task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)

Return type:

Task

Raises:

RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.

select_templates(templates: Iterable) None

Select templates from suggested templates by identifier.

Parameters:

templates (iterable) – The selected templates as an interable.

Return type:

None

Raises:

RuntimeError – Alignment has not been generated yet

show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]

Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).

Parameters:

get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.

Returns:

suggestion – DataFrame with calculated sequence identity and sequence coverage for target

Return type:

pd.DataFrame

Raises:

RuntimeError – Alignment has not been generated yet

Notes

The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.

In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.

class homelette.alignment.AlignmentGenerator_from_aln(alignment_file: str, target: str, template_location: str = './templates/', file_format: str = 'fasta')

Bases: AlignmentGenerator

Reads an alignment from file into the AlignmentGenerator workflow.

Parameters:
  • alignment_file (str) – The file to read the alignment from.

  • target (str) – The name of the target sequence in the alignment.

  • template_location (str) – Directory where processed templates will be stored (default ‘./templates/’).

  • file_format (str, optional) – The format of the alignment file. Can be ‘fasta’ or ‘pir’ (default ‘fasta’).

Variables:
  • alignment (Alignment) – The alignment.

  • target_seq (str) – The target sequence.

  • target (str) – The name of the target sequence.

  • template_location (str) – Directory where processed templates will be stored.

  • state (dict) – Dictionary describing the state of the AlignmentGenerator object.

Return type:

None

Notes

Useful for making use of the PDB download and processing functions that come with the AlignmentGenerator classes.

get_suggestion()

Not implemented, since alignment is read from file on initialization.

Raises:

NotImplementedError

from_fasta(*args, **kwargs)

Not implemented, since alignment is read from file on initialization.

Raises:

NotImplementedError

get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None

Downloads and processes templates present in alignment.

Parameters:
  • pdb_format (str) – Format of PDB identifiers in alignment (default auto)

  • verbose (bool) – Explain what operations are performed

Raises:
  • RuntimeError – Alignment has not been generated yet

  • ValueError – PDB format could not be guessed

Notes

pdb_format tells the function how to parse the template identifiers in the alignment:

  • auto: Automatic guess for pdb_format

  • entry: Sequences are named only be their PDB identifier (i.e. 4G0N)

  • entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)

  • instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)

Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).

During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.

initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task

Initialize a homelette Task object for model generation and evaluation.

Parameters:
  • task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.

  • overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).

  • task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)

Return type:

Task

Raises:

RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.

select_templates(templates: Iterable) None

Select templates from suggested templates by identifier.

Parameters:

templates (iterable) – The selected templates as an interable.

Return type:

None

Raises:

RuntimeError – Alignment has not been generated yet

show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]

Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).

Parameters:

get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.

Returns:

suggestion – DataFrame with calculated sequence identity and sequence coverage for target

Return type:

pd.DataFrame

Raises:

RuntimeError – Alignment has not been generated yet

Notes

The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.

In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.

homelette.alignment.assemble_complex_aln(*args: Type[Alignment], names: dict) Type[Alignment]

Assemble complex alignments compatible with MODELLER from individual alignments.

Parameters:
  • *args (Alignment) – The input alignments

  • names (dict) – Dictionary instructing how sequences in the different alignment objects are supposed to be arranged in the complex alignment. The keys are the names of the sequences in the output alignments. The values are iterables of the sequence names from the input alignments in the order they are supposed to appaer in the output alignment. Any value that can not be found in the alignment signals that this position in the complex alignment should be filled with gaps.

Returns:

Assembled complex alignment

Return type:

Alignment

Examples

>>> aln1 = hm.Alignment(None)
>>> aln1.sequences = {
...     'seq1_1': hm.alignment.Sequence('seq1_1', 'HELLO'),
...     'seq2_1': hm.alignment.Sequence('seq2_1', 'H---I'),
...     'seq3_1': hm.alignment.Sequence('seq3_1', '-HI--')
...     }
>>> aln2 = hm.Alignment(None)
>>> aln2.sequences = {
...     'seq2_2': hm.alignment.Sequence('seq2_2', 'KITTY'),
...     'seq1_2': hm.alignment.Sequence('seq1_2', 'WORLD')
...     }
>>> names = {'seq1': ('seq1_1', 'seq1_2'),
...          'seq2': ('seq2_1', 'seq2_2'),
...          'seq3': ('seq3_1', 'gaps')
...     }
>>> aln_assembled = hm.alignment.assemble_complex_aln(
...     aln1, aln2, names=names)
>>> aln_assembled.print_clustal()
seq1        HELLO/WORLD
seq2        H---I/KITTY
seq3        -HI--/-----