Structural Motif Search

Biochemical and biological functions of proteins are the product of both the overall fold of the polypeptide chain, and, typically, structural motifs made up of smaller numbers of amino acids constituting a catalytic center or a binding site that may be remote from one another in amino acid sequence. Detection of such structural motifs can provide valuable insights into the function(s) of previously uncharacterized proteins.

Technically, this remains an extremely challenging problem because of the size of the Protein Data Bank (PDB) archive. We have developed a new approach that uses an inverted index strategy capable of analyzing >200,000 PDB structures with unmatched speed. The efficiency of our inverted index method depends critically on identifying the small number of structures containing the query motif and ignoring most of the structures that are irrelevant. Our approach enables real-time retrieval and superposition of structural motifs, either extracted from a reference structure or uploaded by the user.

See it in Action

Structural motif searching is available as part of the RCSB Advanced Search and RCSB Mol* plugin. Help documentation is available.

Performance

Current benchmark times to search in 208,702 PDB structures and 1,068,577 AlphaFold/RoseTTAFold predictions as of 8/16/23, obtained on an instance with 6 cores and 64 GB memory. All structure data is held in memory, inverted index data is read from an SSD.

Allowing only Experimental/Archived Structures

Motif	Definition	Found Assemblies	'Paths' Time [ms]	'Score' Time [ms]	Total Time [ms]
Serine Protease	4cha - His:B-42, Asp:B-87, Ser:C-47	5,309	618	22	673
Aminopeptidase	1lap - Lys:A-250, Asp:A-255, Asp:A-273, Asp:A-332, Glu:A-334	91	158	1	181
Zinc Fingers	1g2f - Cys:F-7 His:F-25 His:F-29	739	135	3	160
Enolase Superfamily	2mnr - Lys:A-162, Asp:A-193, Glu:A-219, Glu:A-245, His:A-295	192	253	2	275
Enolase Superfamily (exchanges)	2mnr - Lys/His:A-162, Asp:A-193, Glu:A-219, Glu/Asp/Asn:A-245, His/Lys:A-295	210	2,996	14	3,032
RNA G-Quadruplex	3ibk - G:A-4, G:A-10, G:B-4, G:B-10	85	2,364	236	2,622

Including Computed Structure Models

Motif	Definition	Found Assemblies	'Paths' Time [ms]	'Score' Time [ms]	Total Time [ms]
Serine Protease	4cha - His:B-42, Asp:B-87, Ser:C-47	10,254	1,710	125	1,988
Aminopeptidase	1lap - Lys:A-250, Asp:A-255, Asp:A-273, Asp:A-332, Glu:A-334	647	352	7	389
Zinc Fingers	1g2f - Cys:F-7 His:F-25 His:F-29	9,442	492	92	686
Enolase Superfamily	2mnr - Lys:A-162, Asp:A-193, Glu:A-219, Glu:A-245, His:A-295	328	659	5	689
Enolase Superfamily (exchanges)	2mnr - Lys/His:A-162, Asp:A-193, Glu:A-219, Glu/Asp/Asn:A-245, His/Lys:A-295	350	5,246	25	5,296
RNA G-Quadruplex	3ibk - G:A-4, G:A-10, G:B-4, G:B-10	85	2,453	253	2,742

Search for all assemblies that contain hits with an RMSD <2 Å. 'Paths' refers to the time spent on inverted index operations, which identify all candidate structures that contain the motif. 'Score' refers to the time spent on aligning candidate structures to the query and computing RMSD values.

Computed structure models ignore unreliable regions with pLDDT <70.

Features

nucleotide support
inter-chain & assembly support
position-specific exchanges
modified residues
support for computed structure models, like from AlphaFold
detect motifs in a structure of interest

Getting Started with a Dependency

strucmotif-search is distributed by maven and supports Java 11+. To get started, append your pom.xml by:

<dependency> <groupId>org.rcsb</groupId> <artifactId>strucmotif-search</artifactId> <version>0.22.0</version> </dependency>

Getting Started by Cloning

An alternative way to use the library is cloning this repository and building the corresponding Maven modules.

Search for Similar Structures by A Single Motif

The Strucmotif class provides a fluent API to process structural motif queries.

Strucmotif.searchForStructures() // several ways can be used to define the query motif - e.g., specify a PDB entry id .defineByPdbIdAndSelection("4cha", // and a collection of sequence positions to extract residues to use as motif List.of(new LabelSelection("B", "1", 42), // HIS new LabelSelection("B", "1", 87), // ASP new LabelSelection("C", "1", 47))) // SER .rmsdCutoff(1.0) .buildParameters() .buildContext() .run() .getHits() .stream() .map(hit -> hit.structureIdentifier() + "_" + hit.assemblyIdentifier() + " @ " + hit.labelSelections() + " - RMSD: " + hit.rmsd()) .forEach(System.out::println);

Detect if a Structure Contains Motifs of Interest

This process can also be reversed to detect whether a structure of unknown function contains characteristic motifs.

// acquire a collection of motifs to screen for Set<EnrichedMotifDefinition> motifs = Strucmotif.getMotifDefinitionRegistry().getEnrichedMotifDefinitions(); Strucmotif.detectMotifs() .defineByPdbIdAndAssemblyId("2mnr", "1") .withMotifs(motifs) .rmsdCutoff(1.0) .buildParameters() .buildContext() .run() .getHits() .stream() .map(hit -> hit.motifIdentifier() + " @ " + hit.labelSelections() + " - RMSD: " + hit.rmsd()) .forEach(System.out::println);

Configuration

Property	Action	Default Value/Behavior
`ccd-url`	URL to the chemical component dictionary	wwPDB
`decimal-places-score`	Number of decimal places reported for scores	`2`
`decimal-places-matrix`	Number of decimal places reported in transformation matrices	`3`
`in-memory-strategy`	Preload structure data for increased performance?	`off`
`loading-chunk-size`	Batch size when holding structure data in memory	`200,000`
`max-results`	Maximum number of results that will be returned	`50,000`
`max-motif-size`	Maximum number of residues that may define a motif	`10`
`per-query-threads`	Number of worker threads per query	available processors
`read-error-strategy`	Behavior upon file bundle read error	`exit`
`query-timeout`	Interrupt queries after `n` milliseconds	`none`
`root-path`	Path where data files will be written	`/opt/data/`

Configure by placing your application.properties on the classpath. All properties specific to this project must be prefixed with strucmotif..

ccd-url provides a reference to the Chemical Component Dictionary. This file is used to handle non-standard amino acids appropriately if the modified-residue-strategy property is set to CCD_PARENT. This property controls how non-standard components are handled and the CCD provides a parent component for most modified residues, which can be used to map them to an amino acid that is known to the codebase (i.e., registered as ResidueType). Note that this property is both needed during the update and at runtime to resolve residue types.

Index Structure Data and Run Updates

You will need to process your corpus of structure data before using the service. This will create an optimized version of all structure files and add them to an inverted index that allows efficient searching.

Details can be found in: UPDATE.md

Files Needed to Run Service

The update produces 5 files that are needed for the service to run:

File	Format	Details
known.list	human-readable TSV	current holdings, incl. identifiers of all index entries and their revision history
renumbered.ffindex	human-readable TSV	summary of all optimized 3D structure data as .bcif.gz
renumbered.data	BinaryCIF files	all optimized 3D structure data, concatenated into one file, separated by `\0`
index.ffindex	human-readable TSV	summary of all inverted index files (one per present residue-pair descriptor)
index.data	colfer files	all individual index files, concatenated into one file, separated by `\0`

Note that these files can't be mixed-and-matched. They contain cross-references and if you update or manipulate one, you'll need to edit all other files to ensure consistency. File bundles of .ffindex and .data can be read and manipulated using ffindex-java or similar implementations.

known.list contains tab-separated 4 values per line: entryId (original identifier), structureIndex (internal int identifier), majorRevision, and minorRevision (tracking the revision history and informing what needs updating).

.ffindex files contain 3 tab-separated values per line: filename, offset (long value that captures where this file starts), and length (filesize in bytes).

Implementation Details

Addressing Structures

Entries might have prohibitively long String identifiers. The library uses an internally managed StructureIndex to address individual structures by a simple int value. This mapping is informed by the content of known.list and established by StructureIndexProvider.

Addressing Residues

Two address schemes exist. LabelSelection is a high-level, object-based way of referencing individual residues. It uses a combination of mmCIF properties, namely label_asym_id, struct_oper_id, and label_seq_id:

LabelSelection ref = new LabelSelection("A", "1", 123);

Internally, access is facilitated using 32-bit unsigned primitive encoded integers (ResidueIndex). It doesn't follow any particular layout rather, all encountered residues are addressed by their index. Chain boundaries are ignored. Operations required for assemblies are honored as they occur in the source file and merely increment the counter. Additional work preserves information on chains and assemblies. Chain and operator names as well as boundaries are stored in memory and can be used to reconstruct LabelSelection instances if needed.

Residue pairs are identified by pairs of these int values. They can be stored as long value by chaining together 1st and 2nd value.

Residue Pair Descriptor

Residue pair descriptors capture the label_comp_id of both interacting residue, their backbone distance, their side-chain distance, and the angle defined between both.

They might look like AL-4-5-4, meaning that an alanine is in contact with a leucine, the distance of their CA atoms is in bin 4, i.e. [4.5, 5.5) A, the distance between their CB atoms is in bin 5, i.e. [5.5, 6.5) A, and the angle between both per-amino-acid vectors is in bin 4, i.e. [70, 90) deg.

These values are the Cartesian product of ResidueType (A, 36 states, 6 bits) x ResidueType (B, 36 states, 6 bits) x DistanceType (C, 32 states, 5 bits) x DistanceType (D, 32 states, 5 bits) x AngleType (E, 10 states, 4 bits) and are stored in an unsigned 32-bit integer. The 32-bit descriptors will use their 4th bit to store metadata (M) that tracks whether the identifier is flipped.

XXXMAAAA AABBBBBB XXCCCCCD DDDDEEEE

A second flavor exists that only tracks DistanceType x DistanceType x AngleType and can be held in an unsigned 16-bit short.

XXCCCCCD DDDDEEEE

Convenience functions to work with these descriptors are provided in the ResiduePairDescriptor class and provide ways to convert between the int representation and its human-readable form.

The Inverted Index

The inverted index is structured by residue pair descriptors. All residue pair descriptors with the same geometric properties (e.g., AL-4-5-4) are grouped together in one file.

Each file holds 3 pieces of data:

int[] structureIndices - references each entry that contains at least 1 occurrence of this residue pair (see StructureIndex above)
int[] positionOffsets - defines boundaries between data on individual structures
int[] identifierData - pairs of residues within a specific structure of a certain residue pair descriptor (see ResidueIndex above)

This data structure is encoded using colfer. This produces a lot of small-ish files. These files are bundled using ffindex-java.

Related Projects

ciftools-java: mmCIF parsing and BinaryCIF implementation
ffindex-java: bundle large amounts of small files together
rcsb-molstar: define motifs in 3D and visualize results

Publication

Bittrich S, Burley SK, Rose AS (2020) Real-time structural motif searching in proteins using an inverted index strategy. PLoS Comput Biol 16(12): e1008502. https://doi.org/10.1371/journal.pcbi.1008502

Name		Name	Last commit message	Last commit date
Latest commit History 994 Commits
.github/workflows		.github/workflows
.idea		.idea
strucmotif-search-benchmark		strucmotif-search-benchmark
strucmotif-search-core		strucmotif-search-core
strucmotif-search-update		strucmotif-search-update
.gitattributes		.gitattributes
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE.md		LICENSE.md
README.md		README.md
motifs.png		motifs.png
pom.xml		pom.xml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Structural Motif Search

See it in Action

Performance

Allowing only Experimental/Archived Structures

Including Computed Structure Models

Features

Getting Started with a Dependency

Getting Started by Cloning

Search for Similar Structures by A Single Motif

Detect if a Structure Contains Motifs of Interest

Configuration

Index Structure Data and Run Updates

Files Needed to Run Service

Implementation Details

Addressing Structures

Addressing Residues

Residue Pair Descriptor

The Inverted Index

Related Projects

Publication

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 3

Uh oh!

Languages

License

rcsb/strucmotif-search

Folders and files

Latest commit

History

Repository files navigation

Structural Motif Search

See it in Action

Performance

Allowing only Experimental/Archived Structures

Including Computed Structure Models

Features

Getting Started with a Dependency

Getting Started by Cloning

Search for Similar Structures by A Single Motif

Detect if a Structure Contains Motifs of Interest

Configuration

Index Structure Data and Run Updates

Files Needed to Run Service

Implementation Details

Addressing Structures

Addressing Residues

Residue Pair Descriptor

The Inverted Index

Related Projects

Publication

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 3

Uh oh!

Languages

Packages