proteinnetpy.data

Module containing methods and classes to work with ProteinNet Data

class LabeledFunction(func, output_types, output_shapes)

Bases: object

Function labeled with output shape and type.

Functions labeled with output shape and data type for inputting into e.g. tf.Dataset objects that need to know types and shapes to initialise neural networks

func

Function called.

Type:

Function

output_types

Potentially nested tuple of strings describing the data types output by the function. These should be in the form recognised by tf.as_dtype to use with TensorFlow datasets.

Type:

tuple

output_shapes

Potentially nested tuple of integer/None lists that describe the array/tensor shapes output by the function. These should be in the form recognised by tf.TensorShape to use with TensorFlow datasets.

Type:

tuple

class ProteinNetDataset(path=None, data=None, filter_func=None, preload=True, **kwargs)

Bases: object

Iterable container for ProteinNet records.

An iterable container for ProteinNet records, allowing looping over entries as record.ProteinNetRecord objects. It supports filtering, len() and indexing. Data is able to be loaded into memory or streamed during iteration to ballance speed and RAM usage.

path

Path to the ProteinNet file.

Type:

str

data

List of record.ProteinNetRecord objects is preload=True else None and Records are loaded during iteration.

Type:

list or None

filter_func

Truthy returning function that determines the records to keep in the dataset

preload

Data is loaded into memory rather than streamed on iteration.

Type:

bool

parser_args

Dictionary of keyword arguments to pass to the record parser.

Type:

dict

class ProteinNetMap(data, func, filter_errors=True, static=False)

Bases: object

Map a function over a ProteinNetDataset.

Map a function over ProteinNetRecords, setup to interface with neural network training loops expecting a generator. It allows results to be stored to maximise speed on additional iterations or for the calculation to be repeated each time to minimise memory usage or generate novel results on each iteraton if the mapped function is stochastic.

data

ProteinNetDataset mapped over if _static=False or the calculated result if _static=True.

Type:

ProteinNetDataset

func

Function mapped over the records.

Type:

Function or LabeledFunction

filter_errors

Records raising an error are skipped with a warning, rather than stopping the map.

Type:

bool

_static

The output is the same on each loop over the dataset

Type:

bool

generate()

Apply func to each record in the dataset.

Apply func to each record in the dataset, yielding the results as a generator. If static=True this simply maps over the precalculated results. This interfacet is provided as an entrypoint for functions expecting a generator function to call to access data, for example tensorflow datasets.

Yields:

Variable – Result of appling self.func to a ProteinNetRecord. See x.func.output_types and x.func.output_shapes to determine the types.

combine_filters(*args)

Combine a series of filters into a single function

Create a function the applies multiple filter functions to a record and returns True only if they all do.

Parameters:

*args (Functions) – Functions taking a ProtienNetRecord and returning True to keep it or False to filter it.

Returns:

A function that applies all the functions in *args and returns the logical AND of their output.

Return type:

Function

make_id_filter(pdb_ids, pdb_chains)

Generate a dataset filter only allowing specific PDB ID/Chains.

Parameters:
  • pdb_ids (list) – List of PDB IDs to accept.

  • pdb_chains (list) – List of PDB chains corresponding to the IDs in pdb_ids.

Returns:

A function returning True for the given PDB ID/Chain and False otherwise.

Return type:

Function

make_length_filter(min_length=None, max_length=None)

Generate a filter function checking if a ProteinNetRecord is within length bounds.

Generate a filter function checking if a ProteinNetRecord’s sequence is within length bounds. Boundaries are open ended, meaning sequences at the boundaries are included.

Parameters:
  • min_length (int or None) – Minimum acceptable length. If None no lower bound is applied.

  • max_length (int or None) – Maximum acceptable length. If None no upper bound is applied.

Returns:

A function returning True if the min_length =< len(record) <= max_length.

Return type:

Function

make_mask_filter(min_rama_prop=0, min_chi_prop=0, min_tertiary_prop=0)

Create a filter function requiring a minimum proportion of structural information to be present.

Create a filter function requiring at least the given proportion of positions to have structural information for the given types. Ramachandran and Chi1 angles can be added to records using the record.ProteinNetRecord.calculate_backbone_angles method or the add_angles_to_proteinnet script.

Parameters:
  • min_rama_prop (Float) – Minimum proportion of positions with backbone angle information present.

  • min_chi_prop (Float) – Minimum proportion of positions with Chi1 variable group angle information present.

  • min_tertiary_prop (Float) – Minimum proportion of positions with tertiary structure coordinate information present.

Returns:

A function returning True if all minimum proportions are met.

Return type:

Function

make_nan_filter(rama=True, chi=True, profiles=False)

Generate filter function checking fields for NaN values.

Generate filter function checking fields for NaN values, which can cause numerical issues for downstream analysis. Other Record data should not contain such values, but the function can be easily extended to check other features if necessary.

Parameters:
  • rama (bool) – Check if any backbone angles are NaN.

  • chi (bool) – Check if any side chain Chi angles are NaN.

  • profiles (bool) – Check if positional profiles contain any NaN values.

Returns:

A function returning False if NaN values are present and True otherwise.

Return type:

Function

profile_filter(rec)

Filter records without profiles.

Filter records without profiles, which are additional numerical data associated with each position in the sequence. These are added by the user and can correspond to things like the output of language models like UniRep or AminoBert.

Parameters:

rec (record.Record) – ProteinNetRecord to test.

Returns:

True if profiles is not None, else False

Return type:

bool