proteinnetpy.record

Record class to represent ProteinNet data entries

class ProteinNetRecord(id, primary, evolutionary=None, info_content=None, secondary=None, tertiary=None, mask=None, rama=None, rama_mask=None, chi=None, chi_mask=None, profiles=None)

Bases: object

Record from a ProteinNet dataset.

A record from a ProteinNet database, with support for torsion angles and additional per-position profiles (e.g. precalculated from a language model). The only required attributes are id and primary, allow various others are derived from these on initialisation. Many attributes are identified from the ID, based on the format of the main ProteinNet files IDs. This will likely fail if other data is put into the same format, but should not reduce functionality much as only a few specific functions utilise this data, which is mainly for information purposes.

id

Record ID

Type:: str

split

The data split the record comes from. Identified from the ID.

Type:: {‘training’, ‘testing’, ‘validation’}

record_class

Split the record comes from, for validation records. Identified from the ID.

Type:: str

source

Source of the record. Identified from the ID.

Type:: str

casp_id

CASP ID of the record, for records in the test set sourced from CASP entries. Identified from the ID.

Type:: str

astral_id

Astral ID of the record, for those sourced from Astral. Identified from the ID.

Type:: str

pdb_id

PDB ID of the record, for those deriving from a PDB entry. Identified from the ID.

Type:: str

pdb_chain_number

Numeric PDB Chain of the record, for those deriving from a PDB entry. Identified from the ID.

Type:: str

pdb_chain

Alphabetical PDB Chain of the record, for those deriving from a PDB entry. Identified from the ID.

Type:: str

evolutionary

Variant frequencies accross a multiple sequence alignment for this protein.

Type:: float ndarray (20, N)

info_content

Information content of the MSA at each position.

Type:: float ndarray (N,)

primary

Protein sequence in single letter code form.

Type:: U1 ndarray (N,)

primary_ind

Protein sequence in integer form, based on the index of amino acids in single letter alphabetical order (see record.AMINO_ACIDS and record.AA_HASH).

Type:: int ndarray (N,)

secondary

Protein secondary structure (currently not included in the dataset)

Type:: ndarray

tertiary

Residue atom coordinates. Rows are x,y,z cartesian coordinates. Each residue includes 3 columns for the N, CA and C’ backbone atoms. This means the atom at position i starts in column 3i and a matrix of N/CA/C’ atom coordinates can be extracted with indeces in the arithmetic series 3x + c with c = 0 for N, c = 1 for CA or c = 2 for C’.

Type:: float ndarray (3, 3N)

mask

Mask indicating which residues have structural information. Residues marked with 1 have information present. The mask needs to be tripled to apply to the tertiary structure array. None if tertiary is None.

Type:: int ndarray (N,)

rama

Backbone angles for each residue, calculated from tertiary. The rows (first index) represent omega, phi and psi angles. Either in radians or normalised between -1 and 1.

Type:: float ndarray (3, N)

rama_mask

Mask indicating which residues have backbone angles, with 1 indicating information is present. None if rama is None.

Type:: int ndarray (N,)

chi

Chi1 side chain rotamer conformation. Either in radians or normalised between -1 and 1. This requires more information than tertiary provides so can only be calculated from the full structural model (for example with the add_angles_to_proteinnet script).

Type:: float ndarray (N,)

chi_mask

Mask indicating which residues have rotamer angles. None if chi is None.

Type:: int ndarray (N,)

profiles

Additional profiles for each amino acid position. Can contain any additional information the user requires.

Type:: ndarray (X, N)

_normalised_angles

Torsion angles have been normalised to vary between -1 and 1, rather than -pi to pi.

Type:: bool

calculate_backbone_angles()

Calculate Omega, Phi, and Psi backbone angles and set the rama attribute.

Calculate Omega, Phi, and Psi backbone angles from the tertiary structure included in ProteinNet, and set the results as the rama attribute. The rama_mask attribute is also caculated and set.

distance_matrix()

Calculate the distance matrix between residues C-alpha atoms.

Calculate the distance matrix between residues C-alpha atoms. ProteinNet coordinates and therefore the distance matrix are in nanometers.

Returns:: Distance matrix where X[i,j] gives the distance in nanometers between residue i and j.
Return type:: float ndarray (N, N)

enumerate_sequence(aa_hash={'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4, 'G': 5, 'H': 6, 'I': 7, 'K': 8, 'L': 9, 'M': 10, 'N': 11, 'P': 12, 'Q': 13, 'R': 14, 'S': 15, 'T': 16, 'V': 17, 'W': 18, 'Y': 19})

Generate a numeric representation of the sequence with each amino acid represented by an interger index.

Generate a numeric representation of the sequence with each amino acid represented by an interger index. The default indeces are in single letter code alphabetical order. This function is used in __init__ to generate primary_ind.

Parameters:: aa_hash (dict, optional) – Dictionary mapping single letter codes to ints. Can be replaced to interface with a model that enumerates amino acids differently. For example Unirep orders amino acids alphabetically by full name.

get_one_hot_sequence(aa_hash={'A': 0, 'C': 1, 'D': 2, 'E': 3, 'F': 4, 'G': 5, 'H': 6, 'I': 7, 'K': 8, 'L': 9, 'M': 10, 'N': 11, 'P': 12, 'Q': 13, 'R': 14, 'S': 15, 'T': 16, 'V': 17, 'W': 18, 'Y': 19})

Generate a 1-hot encoded matrix of the proteins sequence.

Generate a 1-hot encoded matrix of the proteins sequence. The default indeces are in single letter code alphabetical order, which can be altered to feed into models expecting different orders.

Parameters:: aa_hash (dict, optional) – Dictionary mapping single letter codes to ints. Can be replaced to interface with a model that enumerates amino acids differently. For example Unirep orders amino acids alphabetically by full name.
Returns:: One-hot encoded primary sequence for the record. Matrix with 20 rows representing each amino acid. Each column has a single 1 in the row of its amino acid.
Return type:: int ndarray (20, N)

normalise_angles(factor=3.141592653589793)

Normalise backbone and chi angles to betweeen -1 and 1.

Normalise backbone and chi angles to betweeen -1 and 1. Also sets a flag indicating angles have been normalised, and does nothing if this is set to prevent normalising twice.

Parameters:: factor (numeric) – Factor to normalise angles by. It will not generally be useful to change this since the package naturally works in radians between -pi and pi, but may be useful if you need to work with angles in other formats.