homelette.alignment
The homelette.alignment
submodule contains a selection of tools for
handling sequences and alignments, as well as for the automatic generation of
sequences from a target sequence.
Tutorials
Basic handing of alignments with homelette is demonstrated in Tutorial 1. The assembling of alignments for complex modelling is discussed in Tutorial 6. The automatic generation of alignments is shown in Tutorial 8.
Functions and classes
Functions and classes present in homelette.alignment are listed below:
- class homelette.alignment.Alignment(file_name: Optional[str] = None, file_format: str = 'fasta')
Bases:
object
Class for managing sequence alignments.
- Parameters:
file_name (str, optional) – The file to read the alignment from. If no file name is given, an empty alignment object will be created (default None)
file_format (str, optional) – The format of the alignment file. Can be ‘fasta’ or ‘pir’ (default ‘fasta’)
- Variables:
sequences (dict) – Collection of sequences. Sequences names are the dictionary keys, Sequence objects the values
- Raises:
ValueError – File_format specified is not ‘fasta’ or ‘pir’
- get_sequence(sequence_name: str) Type[Sequence]
Retrieve sequence object by sequence name.
- Parameters:
sequence_name (str) – Name of sequence to retrieve
- Return type:
- select_sequences(sequence_names: Iterable) None
Select sequences to remain in the alignment by sequence name
- Parameters:
sequence_names (iterable) – Iterable of sequence names
- Return type:
None
- Raises:
KeyError – Sequence name not found in alignment
- remove_sequence(sequence_name: str) None
Remove a sequence from the alignment by sequence name.
- Parameters:
sequence_name (str) – Sequence name to remove from alignment
- Return type:
None
- rename_sequence(old_name: str, new_name: str) None
Rename sequence in the alignment
- Parameters:
old_name (str) – Old name of sequence
new_name (str) – New name of sequence
- Return type:
None
- write_pir(file_name: str, line_wrap: int = 50) None
Write alignment to file in the PIR file format.
- Parameters:
file_name (str) – File name to write to
line_wrap (int) – Characters per line (default 50)
- Return type:
None
- write_fasta(file_name: str, line_wrap: int = 80) None
Write alignment to file in the FASTA alignment file format.
- Parameters:
file_name (str) – File name to write to
line_wrap (int) – Characters per line (default 80)
- Return type:
None
- print_clustal(line_wrap: int = 80) None
Print alignment to console in the clustal file format.
- Parameters:
line_wrap (int) – Characters per line (default 80)
- Return type:
None
- write_clustal(file_name: str, line_wrap: int = 50) None
Write alignment to file in the clustal file format.
- Parameters:
file_name (str) – File name to write to
line_wrap (int) – Characters per line (default 50)
- Return type:
None
- remove_redundant_gaps() None
Remove gaps in the alignment that are present in every column.
- Return type:
None
- replace_sequence(seq_name: str, new_sequence: str) None
Targeted replacement of sequence in alignment.
- Parameters:
seq_name (str) – The identifier of the sequence that will be replaced.
new_sequence (str) – The new sequence.
Notes
This replacement is designed to introduce missing residues from template structures into the alignment and therefore has very strict requirements. The new and old sequence have to be identical, except that the new sequence might contain unmodelled residues. These are indicated by the letter ‘X’ in the new sequence, and will result in a gap ‘-’ in the alignment after replacement. It is important that all unmodelled residues, even at the start or beginning of the template sequence are correctly labeled as ‘X’.
Examples
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'), ... 'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> replacement_seq1 = 'AAAAXXXXXDDD' >>> replacement_seq3 = 'AAXXXXDD' >>> aln.replace_sequence('seq1', replacement_seq1) >>> aln.print_clustal() seq1 AAAA-----DDD seq2 AAAAEEEEDDDD seq3 AAAA----DDDD >>> aln.replace_sequence('seq3', replacement_seq3) >>> aln.print_clustal() seq1 AAAA-----DDD seq2 AAAAEEEEDDDD seq3 AA--------DD
- calc_identity(sequence_name_1: str, sequence_name_2: str) float
Calculate sequence identity between two sequences in the alignment.
- Parameters:
sequence_name_1 (str) – Sequence pair to calculate identity for
sequence_name_2 (str) – Sequence pair to calculate identity for
- Returns:
identity – Sequence identity between the two sequences
- Return type:
float
Notes
There are mutiple ways of calculating sequence identity, which can be useful in different situations. Here implemented is one way which makes a lot of sence for evaluating templates for homology modelling. The sequence identity is calculated by dividing the number of matches by the length of sequence 1 (mismatches and gaps are handled identically, no gap compression).
\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]Examples
Gaps and mismatches are treated equally.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'), ... 'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_identity('seq1', 'seq2') 66.67 >>> aln.calc_identity('seq1', 'seq3') 66.67
Normalization happens for the length of sequence 1, so the order of sequences matters.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_identity('seq1', 'seq2') 66.67 >>> aln.calc_identity('seq2', 'seq1') 100.0
- calc_pairwise_identity_all() Type[pandas.DataFrame]
Calculate identity between all sequences in the alignment.
- Returns:
identities – Dataframe with pairwise sequence identites
- Return type:
pd.DataFrame
See also
Notes
Calculates sequence identity as descripted for calc_identity:
\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]
- calc_identity_target(sequence_name: str) Type[pandas.DataFrame]
Calculate identity of all sequences in the alignment to specified target sequence.
- Parameters:
sequence_name (str) – Target sequence
- Returns:
identities – Dataframe with pairwise sequence identities
- Return type:
pd.DataFrame
See also
Notes
Calculates sequence identity as descripted for calc_identity:
\[\text{seqid} = \frac{\text{matches}} {\text{length}(\text{sequence1})}\]
- calc_coverage(sequence_name_1: str, sequence_name_2: str) float
Calculation of coverage of sequence 2 to sequence 1 in the alignment.
- Parameters:
sequence_name_1 (str) – Sequence pair to calculate coverage for
sequence_name_2 (str) – Sequence pair to calculate coverage for
- Returns:
coverage – Coverage of sequence 2 to sequence 1
- Return type:
float
Notes
Coverage in this context means how many of the residues in sequences 1 are assigned a residue in sequence 2. This is useful for evaluating potential templates, because a low sequence identity (as implemented in homelette) could be caused either by a lot of residues not being aligned at all, or a lot of residues being aligned but not with identical residues.
\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]Examples
Gaps and mismatches are not treated equally.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq2', 'AAAAEEEEDDDD'), ... 'seq3': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_coverage('seq1', 'seq2') 100.0 >>> aln.calc_coverage('seq1', 'seq3') 66.67
Normalization happens for the length of sequence 1, so the order of sequences matters.
>>> aln = hm.Alignment(None) >>> aln.sequences = { ... 'seq1': hm.alignment.Sequence('seq1', 'AAAACCCCDDDD'), ... 'seq2': hm.alignment.Sequence('seq3', 'AAAA----DDDD') ... } >>> aln.calc_coverage('seq1', 'seq2') 66.67 >>> aln.calc_coverage('seq2', 'seq1') 100.0
- calc_coverage_target(sequence_name: str) Type[pandas.DataFrame]
Calculate coverage of all sequences in the alignment to specified target sequence.
- Parameters:
sequence_name (str) – Target sequence
- Returns:
coverages – Dataframe with pairwise coverage
- Return type:
pd.DataFrame
See also
Notes
Calculates coverage as described for calc_coverage:
\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]
- calc_pairwise_coverage_all() Type[pandas.DataFrame]
Calculate coverage between all sequences in the alignment.
- Returns:
coverages – Dataframe with pairwise coverage
- Return type:
pd.DataFrame
See also
Notes
Calculates coverage as described for calc_coverage:
\[\text{coverage} = \frac{\text{aligned residues}} {\text{length}(\text{sequence1})}\]
- class homelette.alignment.Sequence(name: str, sequence: str, **kwargs)
Bases:
object
Class that contains individual sequences and miscellaneous information about them.
- Parameters:
name (str) – Identifier of the sequence
sequence (str) – Sequence in 1 letter amino acid code
**kwargs – Annotations, for acceptable keys see
Sequence.annotate()
- Variables:
name (str) – Identifier of the sequence
sequence (str) – Sequence in 1 letter amino acid code
annotation (dict) – Collection of annotation for this sequence
Notes
See
Sequence.annotate()
for more information on the annotation of sequences.- annotate(**kwargs: str)
Change annotation for sequence object.
Keywords not specified in the Notes section will be ignored.
- Parameters:
kwargs (str) – Annotations. For acceptible values, see Notes.
- Return type:
None
Notes
Annotations are important for MODELLER in order to properly process alignment in PIR format. The following annotations are supported and can be modified.
annotation
explanation
seq_type
Specification whether sequence should be treated as a template (set to ‘structure’) or as a target (set to ‘sequence’)
pdb_code
PDB code corresponding to sequence (if available)
begin_res
Residue number for the first residue of the sequence in the corresponing PDB file
begin_chain
Chain identifier for the first residue of the sequence in the corresponding PDB file
end_res
Residue number for the last residue of the sequence in the corresponding PDB file
end_chain
Chain identifier for the last residue of the sequence in the corresponding PDB file
prot_name
Protein name, optional
prot_source
Protein source, optional
resolution
Resolution of PDB structure, optional
R_factor
R-factor of PDB structure, optional
Different types of annotations are required, depending whether a target or a template is annotated. For targets, it is sufficient to seq the seq_type to ‘sequence’. For templates, it is required by MODELLER that seq_type and pdb_code are annotated. begin_res, begin_chain, end_res and end_chain are recommended. The rest can be left unannoted.
Examples
Annotation for a target sequence.
>>> target = hm.alignment.Sequence(name = 'target', sequence = ... 'TARGET') >>> target.annotation {'seq_type': '', 'pdb_code': '', 'begin_res': '', 'begin_chain': '', 'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''} >>> target.annotate(seq_type = 'sequence') >>> target.annotation {'seq_type': 'sequence', 'pdb_code': '', 'begin_res': '', 'begin_chain': '', 'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''}
Annotation for a template structure.
>>> template = hm.alignment.Sequence(name = 'template', sequence = ... 'TEMPLATE') >>> template.annotation {'seq_type': '', 'pdb_code': '', 'begin_res': '', 'begin_chain': '', 'end_res': '', 'end_chain': '', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''} >>> template.annotate(seq_type = 'structure', pdb_code = 'TMPL', ... begin_res = '1', begin_chain = 'A', end_res = '8', end_chain = ... 'A') >>> template.annotation {'seq_type': 'structure', 'pdb_code': 'TMPL', 'begin_res': '1', 'begin_chain': 'A', 'end_res': '8', 'end_chain': 'A', 'prot_name': '', 'prot_source': '', 'resolution': '', 'r_factor': ''}
- get_annotation_pir() str
Return annotation in the colon-separated format expected from the PIR alignment format used by MODELLER.
- Returns:
Annotation in PIR format
- Return type:
str
Examples
>>> template = hm.alignment.Sequence(name = 'template', sequence = ... 'TEMPLATE', seq_type = 'structure', pdb_code = 'TMPL', ... begin_res = '1', begin_chain = 'A', end_res = '8', end_chain = ... 'A') >>> template.get_annotation_pir() 'structure:TMPL:1:A:8:A::::'
- get_annotation_print() None
Print annotation to console
- Return type:
None
Examples
>>> template = hm.alignment.Sequence(name = 'template', sequence = ... 'TEMPLATE', seq_type = 'structure', pdb_code = 'TMPL', ... begin_res = '1', begin_chain = 'A', end_res = '8', end_chain = ... 'A') >>> template.get_annotation_print() Sequence Type structure PDB ID TMPL Start Residue 1 Start Chain A End Residue 8 End Chain A Protein Name Protein Source Resolution R-Factor
- get_gaps() tuple
Find gap positions in sequence
- Returns:
Positions of gaps in sequence
- Return type:
tuple
Examples
>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE') >>> seq.get_gaps() (3, 7, 8)
- remove_gaps(remove_all: bool = False, positions: Optional[Iterable[int]] = None) None
Remove gaps from the sequence.
Gaps in the alignment are symbolized by ‘-’. Removal can either happen at specific or all positions. Indexing for specific positions is zero-based and checked before removal (raises Warning if the attempted removal of a non-gap position is detected)
- Parameters:
remove_all (bool) – Remove all gaps (default False)
positions (iterable) – Positions to remove (zero-based indexing)
- Return type:
None
- Warns:
UserWarning – Specified position is not a gap
Examples
Example 1: remove all
>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE') >>> seq.remove_gaps(remove_all = True) >>> seq.sequence 'SEQUENCE'
Example 2: selective removal
>>> seq = hm.alignment.Sequence(name = 'seq', sequence = 'SEQ-UEN--CE') >>> seq.remove_gaps(positions = (7, 8)) >>> seq.sequence 'SEQ-UENCE'
- class homelette.alignment.AlignmentGenerator(sequence: str, target: str = 'target', template_location: str = './templates/')
Bases:
ABC
Parent class for the auto-generation of alignments and template selection based on sequence input.
- Parameters:
sequence (str) – Target sequence in 1 letter amino acid code.
target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state – Dictionary describing the state of the AlignmentGenerator object
- Return type:
None
- abstract get_suggestion()
Generate suggestion for templates and alignment
- classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator
Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.
- Parameters:
fasta_file (str) – Fasta file from which the first sequence will be read.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Return type:
- Raises:
ValueError – Fasta file not properly formatted
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- class homelette.alignment.AlignmentGenerator_pdb(sequence: str, target: str = 'target', template_location: str = './templates/')
Bases:
AlignmentGenerator
Identification of templates using the RCSB search API, generation of alignment using Clustal Omega and download and processing of template structures.
- Parameters:
sequence (str) – Target sequence in 1 letter amino acid code.
target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state – Dictionary describing the state of the AlignmentGenerator object
- Return type:
None
Notes
The AlignmentGenerator uses the RCSB Search API [1] to identify potential template structures given the target sequence using MMseq2 [2]. The sequences of the potentially downloaded and locally aligned using Clustal Omega [3] [4].
References
- get_suggestion(seq_id_cutoff: float = 0.5, min_length: int = 30, max_results: int = 50, xray_only: bool = True, verbose: bool = True) None
Identifies potential templates, retrieves their sequences and aligns them locally using Clustal Omega.
- Parameters:
seq_id_cutoff (float) – The sequence identity cutoff for the identification of template structures. Templates below this threshold will be ignored (default 0.5).
min_length (int) – The minimum length of template sequence to be included in the results (default 30 amino acids).
max_results (int) – The number of results returned (default 50).
xray_only (bool) – Only consider templates structures generated with X-ray crystallography (default True).
verbose (bool) – Explain what is done (default True).
- Return type:
None
- Raises:
RuntimeError – Alignment already generated.
- classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator
Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.
- Parameters:
fasta_file (str) – Fasta file from which the first sequence will be read.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Return type:
- Raises:
ValueError – Fasta file not properly formatted
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- class homelette.alignment.AlignmentGenerator_hhblits(sequence: str, target: str = 'target', template_location: str = './templates/')
Bases:
AlignmentGenerator
Identification of templates using hhblits to search a local PDB database, generation of alignment by combining pairwise alignments of target and template together.
- Parameters:
sequence (str) – Target sequence in 1 letter amino acid code.
target (str) – The name of the target sequence (default “target”). If longer than 14 characters, will be truncated.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state – Dictionary describing the state of the AlignmentGenerator object.
- Return type:
None
Notes
HHblits from the HHsuite [5] is used to query the databases. The resulting pairwise sequence alignments of template to target are combined using the target sequence as the master sequence. The resulting alignment is therefore, strictly speaking, not a proper multiple sequence alignment. However, all information from the pairwise alignments is preserved, and for homology modelling, the alignments of templates among each others do not have any influence.
References
- get_suggestion(database_dir: str = './databases/', use_uniref: bool = False, evalue_cutoff: float = 0.001, iterations: int = 2, n_threads: int = 2, neffmax: float = 10.0, verbose: bool = True) None
Use HHblits to identify template structures and create a multiple sequence alignment by combination of pairwise alignments on target sequence.
- Parameters:
database_dir (str) – The directory where the pdb70 (and the UniRef30) database are stored (default ./databases/).
use_uniref (bool) – Use UniRef30 to create a MSA before querying the pdb70 database (default False). This leads to better results, however it takes longer and requires the UniRef30 database on your system.
evalue_cutoff (float) – E-value cutoff for inclusion in result alignment (default 0.001)
iterations (int) – Number of iterations when querying the pdb70 database.
n_threads (int) – Number of threads when querying the pdb70 (or UniRef30) database (default 2).
neffmax (float) – The neffmax value used when querying the pdb70 database (default 10.0).
verbose (bool) – Explain which operations are performed (default True).
- Return type:
None
- Raises:
RuntimeError – Alignment has already been generated.
Notes
This function expects “hhblits” to be installed and in the path. In addition, the pdb70 database needs to be downloaded and extracted in the database_dir. The files need to be called “pdb70_*” for hhblits to correctly find the database. If UniRef30 is used to create a pre-alignment for better results, the UniRef30 database needs to be downloaded and extracted in the database_dir. The files need to be called “UniRef30_*”.
For more information on neffmax, please check the hhblits documentation.
If UniRef30 is used to generate a prealignment, then hhblits will be called for one iteration with standard parameters.
- classmethod from_fasta(fasta_file: str, template_location: str = './templates/') AlignmentGenerator
Generates an instance of the AlignemntGenerator with the first sequence in the fasta file.
- Parameters:
fasta_file (str) – Fasta file from which the first sequence will be read.
template_location (str) – Directory where processed templates will be stored (default “./templates/”).
- Return type:
- Raises:
ValueError – Fasta file not properly formatted
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- class homelette.alignment.AlignmentGenerator_from_aln(alignment_file: str, target: str, template_location: str = './templates/', file_format: str = 'fasta')
Bases:
AlignmentGenerator
Reads an alignment from file into the AlignmentGenerator workflow.
- Parameters:
alignment_file (str) – The file to read the alignment from.
target (str) – The name of the target sequence in the alignment.
template_location (str) – Directory where processed templates will be stored (default ‘./templates/’).
file_format (str, optional) – The format of the alignment file. Can be ‘fasta’ or ‘pir’ (default ‘fasta’).
- Variables:
alignment (Alignment) – The alignment.
target_seq (str) – The target sequence.
target (str) – The name of the target sequence.
template_location (str) – Directory where processed templates will be stored.
state (dict) – Dictionary describing the state of the AlignmentGenerator object.
- Return type:
None
Notes
Useful for making use of the PDB download and processing functions that come with the AlignmentGenerator classes.
- get_suggestion()
Not implemented, since alignment is read from file on initialization.
- Raises:
NotImplementedError –
- from_fasta(*args, **kwargs)
Not implemented, since alignment is read from file on initialization.
- Raises:
NotImplementedError –
- get_pdbs(pdb_format: str = 'auto', verbose: bool = True) None
Downloads and processes templates present in alignment.
- Parameters:
pdb_format (str) – Format of PDB identifiers in alignment (default auto)
verbose (bool) – Explain what operations are performed
- Raises:
RuntimeError – Alignment has not been generated yet
ValueError – PDB format could not be guessed
Notes
pdb_format tells the function how to parse the template identifiers in the alignment:
auto: Automatic guess for pdb_format
entry: Sequences are named only be their PDB identifier (i.e. 4G0N)
entity: Sequences are named in the format PDBID_ENTITY (i.e. 4G0N_1)
instance: Sequences are named in the format PDBID_CHAIN (i.e. 4G0N_A)
Please make sure that all templates follow one naming convention, and that there are no sequences in the alignment that violate the naming convention (except the target sequence).
During the template processing, all hetatms will be remove from the template, as well as all other chains. All chains will be renamed to “A” and the residue number will be set to 1 on the first residue. The corresponding annotations are automatically made in the alignment object.
- initialize_task(task_name: ~typing.Optional[str] = None, overwrite: bool = False, task_class: ~homelette.organization.Task = <class 'homelette.organization.Task'>) Task
Initialize a homelette Task object for model generation and evaluation.
- Parameters:
task_name (str) – The name of the task to initialize. If None, initialize as models_{target}.
overwrite (bool) – Whether to overwrite the task directory if a directory of the same name already exists (default False).
task_class (Task) – The class to initialize the Task with. This makes it possible to define custom child classes of Task and construct them from this function (default Task)
- Return type:
- Raises:
RuntimeError – Alignment has not been generated or templates have not been downloaded and processed.
- select_templates(templates: Iterable) None
Select templates from suggested templates by identifier.
- Parameters:
templates (iterable) – The selected templates as an interable.
- Return type:
None
- Raises:
RuntimeError – Alignment has not been generated yet
- show_suggestion(get_metadata: bool = False) Type[pandas.DataFrame]
Shows which templates have been suggested by the AlignmentGenerator, as well as some useful statistics (sequence identity, coverage).
- Parameters:
get_metadata (bool) – Retrieve additional metadata (experimental method, resolution, structure title) from the RCSB.
- Returns:
suggestion – DataFrame with calculated sequence identity and sequence coverage for target
- Return type:
pd.DataFrame
- Raises:
RuntimeError – Alignment has not been generated yet
Notes
The standard output lists the templates in the alignment and shows both coverage and sequence identity to the target sequence. The templates are ordered by sequence identity.
In addition, the experimental method (Xray, NMR or Electron Microscopy), the resolution (if applicable) and the title of the template structure can be retrieved from the RCSB. Retrieving metadata from the PDB requires a working internet connecction.
- homelette.alignment.assemble_complex_aln(*args: Type[Alignment], names: dict) Type[Alignment]
Assemble complex alignments compatible with MODELLER from individual alignments.
- Parameters:
*args (Alignment) – The input alignments
names (dict) – Dictionary instructing how sequences in the different alignment objects are supposed to be arranged in the complex alignment. The keys are the names of the sequences in the output alignments. The values are iterables of the sequence names from the input alignments in the order they are supposed to appaer in the output alignment. Any value that can not be found in the alignment signals that this position in the complex alignment should be filled with gaps.
- Returns:
Assembled complex alignment
- Return type:
Examples
>>> aln1 = hm.Alignment(None) >>> aln1.sequences = { ... 'seq1_1': hm.alignment.Sequence('seq1_1', 'HELLO'), ... 'seq2_1': hm.alignment.Sequence('seq2_1', 'H---I'), ... 'seq3_1': hm.alignment.Sequence('seq3_1', '-HI--') ... } >>> aln2 = hm.Alignment(None) >>> aln2.sequences = { ... 'seq2_2': hm.alignment.Sequence('seq2_2', 'KITTY'), ... 'seq1_2': hm.alignment.Sequence('seq1_2', 'WORLD') ... } >>> names = {'seq1': ('seq1_1', 'seq1_2'), ... 'seq2': ('seq2_1', 'seq2_2'), ... 'seq3': ('seq3_1', 'gaps') ... } >>> aln_assembled = hm.alignment.assemble_complex_aln( ... aln1, aln2, names=names) >>> aln_assembled.print_clustal() seq1 HELLO/WORLD seq2 H---I/KITTY seq3 -HI--/-----