proteinnetpy.mutation

Module containing functions for mutating ProteinNetRecords and feeding that data into further computations (e.g. Tensorflow). These functions are fairly specific so may often be better used as inspiration to build users own solutions.

class ProteinNetMutator(mutator, per_position=False, include=('wt',), weights=(0, 1, 1), encoding=None, **kwargs)

Bases: LabeledFunction

Map function generating mutated records.

Apply a mutator function to a ProteinNet record and return the mutated sequence. This is a LabeledFunction that can be used to generate a TensorFlow Dataset. This setup is fairly specific to your downstream model design, so it will often be more useful to use it as a base to create an alternate implementation.

Returns are in the form:

([wt_seq], mut_seq, [phi, psi, chi1]), label, [weights]

wildtype

Outputs wildtype as well as mutant sequence.

Type:

bool

phi

Outputs Phi backbone angles.

Type:

bool

psi

Outputs Psi backbone angles.

Type:

bool

chi

Outputs rotamer angles.

Type:

bool

mutator

Mutator function taking a ProteinNetRecord and returning the sampled variants and their deleteriousness. The return format depends on per_position. If per_position=False must return a tuple with the mutated sequence index array and whether it is deleterious (1/0). If per_position=True must return a tuple with mutant_seq, deleterious_inds, neutral_inds arrays.

Type:

function

kwargs

Keyword arguments passed to the mutator function.

Type:

dict

encoding

Encoding mapping alphabetically encoded integer indeces to a new scheme.

Type:

dict

weights

List of float weights for WT, Deleterious and Neutral variants when mutating per position.

Type:

list

func

Function applied when the class is called. This is a mutator applied to the whole sequence or per position derived from the initialisation parameters.

Type:

function

output_shapes, output_types

Tuple of output shapes and types (see data.LabeledFunction for details)

Type:

tuple

per_position_mutator(record, max_deleterious=2, max_neutral=4, max_deleterious_freq=0.01, min_neutral_freq=0.1)

Generate mutated sequences from ProteinNetRecords with labels identifying deleterious and neutral mutations.

Generate mutated sequences from ProteinNetRecords with labels identifying where deleterious and neutral mutations have been made. Will always generate at least one variant.

Parameters:
  • record (ProteinNetRecord) – Record to mutate.

  • max_deleterious (int) – Maximum number of deleterious variants to make.

  • max_neutral (int) – Maximum number of neutral variants to make.

  • max_deleterious_freq (float) – Maximum MSA frequency for a variant to be considered deleterious.

  • min_neutral_freq (float) – Minimum MSA frequency for a variant to be considered neutral.

Returns:

Tuple of the format seq, deleterious, neutral. The first entry is the mutated sequence, the second a list of positions with deleterious variants and the third a list of positions with neutral variants.

Return type:

tuple

sample_deleterious(num, pssm, wt_seq, max_freq=0.025, mask=None)

Sample deleterious mutations from a MSA frequency matrix.

Randomly choose a selection of deleterious variants from a MSA frequency matrix.

Parameters:
  • num (int) – Number of mutations to make.

  • pssm (float ndarray (20, N)) – MSA frequency matrix to determine neutral and deleterious variants.

  • wt_seq (int ndarray (N,)) – WT sequence of the protein (as int indeces corresponding to the MSA matrix rows).

  • max_freq (float) – Maximum frequency considered deleterious.

  • mask (int array_like) – Array of positions not to mutate.

Returns:

Numpy array of position indeces chosen and an array of the alternate amino acid in each position (as MSA row indeces).

Return type:

tuple

sample_neutral(num, pssm, wt_seq, min_freq=0.025, mask=None)

Sample deleterious mutations froma pssm

Parameters:
  • num (int) – Number of mutations to make.

  • pssm (float ndarray (20, N)) – MSA frequency matrix to determine neutral and deleterious variants.

  • wt_seq (int ndarray (N,)) – WT sequence of the protein (as int indeces corresponding to the MSA matrix rows).

  • min_freq (float) – Minimum frequency considered neutral.

  • mask (int array_like) – Array of positions not to mutate.

Returns:

Numpy array of position indeces chosen and an array of the alternate amino acid in each position (as MSA row indeces).

Return type:

tuple

sequence_mutator(record, p_deleterious=0.5, max_mutations=3, max_deleterious=0.01, min_neutral=0.1)

Generate mutated sequences from a ProteinNetRecord with a few deleterious or neutral variants.

Generate mutated sequences from a ProteinNetRecord with a few deleterious and/or neutral variants. First randomly choose to generate a deleterious or neutral sequence then sample some of the corresponding variant types based on the records MSA frequencies.

Parameters:
  • record (ProteinNetRecord) – Record to mutate.

  • p_deleterious (float) – Probability of returning a deleterious set of variants.

  • max_mutations (int) – Maximum number of mutations to make.

  • max_deleterious (float) – Maximum MSA frequency for a variant to be considered deleterious.

  • min_neutral (float) – Minimum MSA frequency for a variant to be considered neutral.

Returns:

Tuple of the format (seq, deleterious). The first entry is the mutated amino acid sequence, encoded with integer indeces and the second is 1 if the sequence is deleterious and 0 if neutral.

Return type:

tuple