proteinnetpy.data
Module containing methods and classes to work with ProteinNet Data
- class LabeledFunction(func, output_types, output_shapes)
Bases:
objectFunction labeled with output shape and type.
Functions labeled with output shape and data type for inputting into e.g. tf.Dataset objects that need to know types and shapes to initialise neural networks
- func
Function called.
- Type:
Function
- output_types
Potentially nested tuple of strings describing the data types output by the function. These should be in the form recognised by tf.as_dtype to use with TensorFlow datasets.
- Type:
tuple
- output_shapes
Potentially nested tuple of integer/None lists that describe the array/tensor shapes output by the function. These should be in the form recognised by tf.TensorShape to use with TensorFlow datasets.
- Type:
tuple
- class ProteinNetDataset(path=None, data=None, filter_func=None, preload=True, **kwargs)
Bases:
objectIterable container for ProteinNet records.
An iterable container for ProteinNet records, allowing looping over entries as record.ProteinNetRecord objects. It supports filtering, len() and indexing. Data is able to be loaded into memory or streamed during iteration to ballance speed and RAM usage.
- path
Path to the ProteinNet file.
- Type:
str
- data
List of record.ProteinNetRecord objects is preload=True else None and Records are loaded during iteration.
- Type:
list or None
- filter_func
Truthy returning function that determines the records to keep in the dataset
- preload
Data is loaded into memory rather than streamed on iteration.
- Type:
bool
- parser_args
Dictionary of keyword arguments to pass to the record parser.
- Type:
dict
- class ProteinNetMap(data, func, filter_errors=True, static=False)
Bases:
objectMap a function over a ProteinNetDataset.
Map a function over ProteinNetRecords, setup to interface with neural network training loops expecting a generator. It allows results to be stored to maximise speed on additional iterations or for the calculation to be repeated each time to minimise memory usage or generate novel results on each iteraton if the mapped function is stochastic.
- data
ProteinNetDataset mapped over if _static=False or the calculated result if _static=True.
- Type:
ProteinNetDataset
- func
Function mapped over the records.
- Type:
Function or LabeledFunction
- filter_errors
Records raising an error are skipped with a warning, rather than stopping the map.
- Type:
bool
- _static
The output is the same on each loop over the dataset
- Type:
bool
- generate()
Apply func to each record in the dataset.
Apply func to each record in the dataset, yielding the results as a generator. If static=True this simply maps over the precalculated results. This interfacet is provided as an entrypoint for functions expecting a generator function to call to access data, for example tensorflow datasets.
- Yields:
Variable – Result of appling self.func to a ProteinNetRecord. See x.func.output_types and x.func.output_shapes to determine the types.
- combine_filters(*args)
Combine a series of filters into a single function
Create a function the applies multiple filter functions to a record and returns True only if they all do.
- Parameters:
*args (Functions) – Functions taking a ProtienNetRecord and returning True to keep it or False to filter it.
- Returns:
A function that applies all the functions in *args and returns the logical AND of their output.
- Return type:
Function
- make_id_filter(pdb_ids, pdb_chains)
Generate a dataset filter only allowing specific PDB ID/Chains.
- Parameters:
pdb_ids (list) – List of PDB IDs to accept.
pdb_chains (list) – List of PDB chains corresponding to the IDs in pdb_ids.
- Returns:
A function returning True for the given PDB ID/Chain and False otherwise.
- Return type:
Function
- make_length_filter(min_length=None, max_length=None)
Generate a filter function checking if a ProteinNetRecord is within length bounds.
Generate a filter function checking if a ProteinNetRecord’s sequence is within length bounds. Boundaries are open ended, meaning sequences at the boundaries are included.
- Parameters:
min_length (int or None) – Minimum acceptable length. If None no lower bound is applied.
max_length (int or None) – Maximum acceptable length. If None no upper bound is applied.
- Returns:
A function returning True if the min_length =< len(record) <= max_length.
- Return type:
Function
- make_mask_filter(min_rama_prop=0, min_chi_prop=0, min_tertiary_prop=0)
Create a filter function requiring a minimum proportion of structural information to be present.
Create a filter function requiring at least the given proportion of positions to have structural information for the given types. Ramachandran and Chi1 angles can be added to records using the record.ProteinNetRecord.calculate_backbone_angles method or the add_angles_to_proteinnet script.
- Parameters:
min_rama_prop (Float) – Minimum proportion of positions with backbone angle information present.
min_chi_prop (Float) – Minimum proportion of positions with Chi1 variable group angle information present.
min_tertiary_prop (Float) – Minimum proportion of positions with tertiary structure coordinate information present.
- Returns:
A function returning True if all minimum proportions are met.
- Return type:
Function
- make_nan_filter(rama=True, chi=True, profiles=False)
Generate filter function checking fields for NaN values.
Generate filter function checking fields for NaN values, which can cause numerical issues for downstream analysis. Other Record data should not contain such values, but the function can be easily extended to check other features if necessary.
- Parameters:
rama (bool) – Check if any backbone angles are NaN.
chi (bool) – Check if any side chain Chi angles are NaN.
profiles (bool) – Check if positional profiles contain any NaN values.
- Returns:
A function returning False if NaN values are present and True otherwise.
- Return type:
Function
- profile_filter(rec)
Filter records without profiles.
Filter records without profiles, which are additional numerical data associated with each position in the sequence. These are added by the user and can correspond to things like the output of language models like UniRep or AminoBert.
- Parameters:
rec (record.Record) – ProteinNetRecord to test.
- Returns:
True if profiles is not None, else False
- Return type:
bool