Conserved Domains Database banner graphic
Structure Home 3D Macromolecular Structures Conserved Domains   PubChem    BioSystems 
 
[Clear]  Help
 
 
Conserved Domains and Protein Classification
 
   

This document includes help for the Conserved Domain Database (CDD) and the CD-Search Tool. Both resources can be used to help elucidate protein function. The data continue to evolve as research progresses. Comments about the data are welcome and can be sent to info@ncbi.nlm.nih.gov. The "How To" page provides quick start guides for some common types of searches.
Once records of interest are retrieved, follow Entrez's "Links" to discover associations among previously disparate data.

 
     
Conserved Domain Database (CDD) Help back to top

 
BRIEF TABLE OF CONTENTS
 
  Database content
Source databases
Data processing
Unique features
Search tips
Protein query sequence
Text term search
Proteins → conserved domains
Domain architectures
Search results
Links to related data
Conserved domain record display
Text Summary
Structures
Conserved features
Multiple sequence alignment
References
 
 
 

What is a conserved domain?

Thumbnail image for 3D structure of type-1 insulin-like growth-factor receptor (IGF-1R), viewed in the free Cn3D structure viewing program and colored by domain.  Click on image to jump to a larger, annotated version in this help document.


3-D structures and
conserved core motifs:


Thumbnail image for example of 3-dimensional structure: Cl- binding residues in Voltage-Gated Chloride Channel, cd00400.  Click on image to jump to a larger, annotated version in this help document.


Conserved features
(binding and catalytic sites)


Thumbnail image for examples of Conserved Features (Sites) in Voltage-Gated Chloride Channel, cd00400, including Cl- selectivity filter, pore-gating glutamate residue, Cl- binding residues, and dimer interface..  Click on image to jump to a larger, annotated version in this help document.


Domain family hierarchies

Thumbnail image of domain hierarchy showing divergence in a protein family based on phylogenetic relationships of protein sequences and functional properties.  Click on image to jump to a larger, annotated version in this help document.

CD-Search Help back to top

 
CD-Search Results: Concise Display:
top-scoring hits only

Thumbnail image for CD-Search results concise display (default), which shows only the top-scoring hits for each region of the query sequence.  Click on image to jump to a larger, annotated version in this help document.


CD-Search Results: Full Display: all hits
Thumbnail image for CD-Search results full display, which shows all hits on each region of the query sequence.  Click on image to jump to a larger, annotated version in this help document.


CD-Search Results: Small Triangles
represent conserved features/sites


Thumbnail image for small triangles shown in CD-Search results.  The triangles point to specific residues involved in conserved features, such as binding and catalytic sites, as mapped from a conserved domain to the query protein sequence. Click on image to jump to a larger, annotated version in this help document.


Specific Hits must meet or exceed
domain-specific threshold score


Thumbnail image that shows the method for determining the domain-specific E-value threshold score for RPS-BLAST.  Each protein sequence that was used to curate a domain model is RPS-BLASTed against the domain model's PSSM.  The highest (i.e., weakest E-value) among the member sequences is the domain-specific Threshold score. If a protein query sequence is RPS_BLASTed against CDD and receives an E-value score equal to or lower than the threshold, that protein is considered a specific hit..  Click on image to jump to a larger, annotated version in this help document.

 
 
  Conserved Domain Database back to top  
 

What is a conserved domain? back to top

Domains can be thought of as distinct functional and/or structural units of a protein. These two classifications coincide rather often, as a matter of fact, and what is found as an independently folding unit of a polypeptide chain also carries specific function. Domains are often identified as recurring (sequence or structure) units, which may exist in various contexts. The image below illustrates four "domains" identified as structural units in the MMDB-entry 1IGR, chain A, as segments colored in magenta, blue, brown, and green.

In molecular evolution such domains may have been utilized as building blocks, and may have been recombined in different arrangements to modulate protein function. We define conserved domains as recurring units in molecular evolution, the extents of which can be determined by sequence and structure analysis.

Conserved domains contain conserved sequence patterns or motifs, which allow for their detection in polypeptide sequences. The distinction between domains and motifs is not sharp, however, especially in the case of short repetitive units. Functional motifs are also present outside the scope of structurally conserved domains. The CD database is not meant to systematically collect such motifs.

3D structure of type-1 insulin-like growth-factor receptor (IGF-1R), viewed in the free Cn3D structure viewing program and colored by domain.
For this query sequence, a good correspondence exists between structural units (3D domains), identified by purely geometric criteria, and units asserted to be evolutionary conserved (domain families). The region annotated as "FU" (furin-repeat like) overlaps with a domain-split that was suggested by the MMDB domain parser.

Click anywhere on the image to open the complete, interactive record for this protein structure (1IGR) in Cn3D, a free helper application available for Windows, Macintosh, and Unix platforms. Cn3D installation takes only a couple of minutes and a tutorial describes the program's features and functions.

Open the 1IGR structure summary record in the Molecular Modeling Database (MMDB) to access more information about the protein, its conserved domains, and ligands (small molecules). Click on a conserved domain or ligand of interest to view its complete information in the Conserved Domain Database or PubChem, respectively. Click on the colored bar representing a 3D domain to retrieve similar 3D structures.

View the CD-Search help document for more details about the program that was used to identify the conserved domains in the protein chain. The concise display of the conserved domains is shown here and includes specific hits, superfamilies, and multi-domains. (Open the actual CD-Search results to view alignments of the query sequence to a conserved domain's consensus sequence, and/or to access a full display of all domain models found.)


Multiple sequence alignments provide basis for conserved domain models back to top

The two types of domains shown in the 1IGR illustration above -- 3D domains and conserved domains (or "domain families") -- often coincide with each other. However, because they represent two distinct types of data -- 3D structures and protein sequences, respectively -- they reside in two distinct databases: the Entrez 3D Domains database and the Conserved Domain Database (CDD). The former shows the spatial (X,Y,Z) coordinates of each atom in a 3D domain, while the latter shows the span and composition of a conserved protein sequence region.

Specifically, conserved domain models are based on multiple sequence alignments of related proteins spanning a variety of organisms to reveal sequence regions containing the same, or similar, patterns of amino acids. The illustration below provides an example, showing the multiple sequence alignment for the Furin-like domain, which is present in the Type 1 Insulin-like Growth Factor Receptor (1IGR) protein. Click anywhere on the image to open the complete, interactive CDD record for that domain model, cd00064. A separate section of this help document provides additional information about multiple sequence alignment display options.

In the CDD database, protein sequences from three-dimensional structures are included in domain models whenever possible, as one goal of the NCBI conserved domain curation effort is to make multiple sequence alignments agree with what we can infer from three-dimensional structure and three-dimensional structure superposition, in order to understand sequence/structure/function relationships. The sequence-based domain models and corresponding 3D structures are also cross-referenced to each other through Entrez "Links" between CDD and structure records.

Multiple sequence alignment for the Furin-like Repeats domain model, cd00064, showing the amino acids that have been conserved among related proteins from a wide variety of organisms. Click anywhere on the image to open the complete, interactive CDD record for this domain model.
This illustration shows the multiple sequence alignment for the Furin-like domain, which is present in the Type 1 Insulin-like Growth Factor Receptor (1IGR) protein. To view the complete multiple sequence alignment for the Furin-like domain model, open the CDD domain summary record for CD00064: FU, Furin-Like Repeats or click anywhere on the image above.

Separate sections of this CDD help document provide additional details about the source databases from which domain models are collected, the conserved domain assembly process, including the generation of a position-specific scoring matrix (PSSM) for each domain model, and multiple sequence alignment display options.


Source Databases:  Where does CDD content come from? back to top

Conserved Domains can be described by local multiple sequence alignments (illustration) spanning a variety of organisms to reveal sequence regions that contain the same, or similar, patterns of amino acids. Computational biologists from all over the world have compiled collections of such alignments representing conserved domains. CDD includes domains curated at NCBI as well as data imported from the external sources listed below

The source databases differ in their scope of coverage and the method by which they develop their models. Therefore, each source database may have its own model for a given conserved domain, in addition to some domain models found only in that database. To provide a non-redundant view of the data, CDD clusters similar domain models from various sources into superfamilies. The data sources include:
NCBI-Curated Domains
NCBI-curated domains use 3D-structure information to explicitly to define domain boundaries, aligned blocks, and amend alignment details. More details about the unique features of NCBI-curated domains are below.

The goal of the curation project is to provide CDD users with insights into how patterns of residue conservation and divergence in a family relate to functional properties, and to provide useful links to more detailed information that may help to understand those sequence/structure/function relationships. The presence of conserved features help to affirm family membership in search results with borderline significance, for example. NCBI CDD Curators provide feature annotation and associated evidence in a computer friendly way, so that the scientific community can build software tools for the automation of tasks like annotation transfer, for example.
External Data Sources
In addition, CDD imports data from four other major sources:
  • SMART, the Simple Modular Architecture Research Tool
  • Pfam, Pfam-A seed alignments from the Protein families database of alignments and HMMs
  • COGs, Clusters of Orthologous Groups of proteins
  • PRK, PRotein K(c)lusters
  • TIGRFAM, The Institute for Genomic Research's database of protein families, a research project of the J. Craig Venter Institute
PFAM is a large collection of multiple sequence alignments and hidden Markov models covering many common protein domains and families. Pfam is maintained by Alex Bateman and colleagues, mainly at the Wellcome Trust Sanger Institute. CDD contains a large fraction of the Pfam collection.
SMART is a web tool for the identification and annotation of protein domains, and provides a platform for the comparative study of complex domain architectures in genes and proteins. SMART is maintained by Chris Ponting, Peer Bork and colleagues, mainly at the EMBL Heidelberg. CDD contains a large fraction of the SMART collection.
COG (Clusters of Orthologous Groups) is an NCBI-curated protein classification resource. Sequence alignments corresponding to COGs are created automatically from constituent sequences and have not been validated manually when imported into CDD.
PRK (Protein Clusters) is an NCBI collection of related protein sequences (clusters) consisting of Reference Sequence proteins encoded by complete prokaryotic and chloroplast plasmids and genomes. It includes both curated and non-curated (automatically generated) clusters.
TIGRFAM is a collection of manually curated protein families from The Institute for Genomic Research and consists of hidden Markov models (HMMs), multiple sequence alignments, Gene Ontology (GO) terminology, cross-references to related models in TIGRFAM and other databases, and pointers to literature.

CDD also contains data from additional research projects, such as KOGs (a eukaryotic counterpart to COGs) and the Library of Ancient Domains (LOAD), contributed by I. Aravind, E. Koonin, and colleagues. The latter data sets are accessible as a separate CD-Search database and on the FTP site, respectively, but are not directly searchable by text term in Entrez CDD.

The content of imported domain models is determined by the providers of the source database, with slight modifications made at NCBI to link a domain model's member sequences to corresponding, complete protein sequence and 3D structure records in Entrez databases, when possible. The method by which imported domain models are integrated into the CDD database is described in the CD assembly process section of this help document.
Accession Prefixes indicate data sources:
Source databases are evident from CD accessions:

Accession starts with:Source Database
cd Curated at NCBI
pfam Pfam
smartSMART
COG COGs
KOG KOGs (available as a separate search set via CD-Search (RPS-BLAST); not searchable by text term in Entrez)
PRK PRotein K(c)lusters (Entrez database)
CHL Chloroplast and organelle proteins; subset of the PRK database.
MTH Mitochondrial proteins; subset of the PRK database.
PHA Phage proteins; subset of the PRK database.
PTZ Protozoan proteins; subset of the PRK database.
TIGR TIGRFAM
LOAD_ Library of Ancient Domains (LOAD) data set. (available as a separate data set via FTP; not searchable by text term in Entrez)

Accessions that start with "cl" are for superfamily cluster records and can contain domain models from one or more source databases.

When searching CDD, it is possible to limit search results to domains from any given source database by using the Database Search Field.

CD Assembly Process:  How have CDs been assembled? back to top

NCBI-curated domain models are assembled using the methods briefly described in the source databases section of this document. More details about the NCBI curation process are provided by Marchler-Bauer, et al. (2007). An example of a multiple sequence alignment on which a model is based is shown in an illustration of the Furin-like domain.

Domain models from external data sources are assembled by various methods, ranging from automated processing to manual curation, depending on the individual source database. Upon import into CDD, protein sequence alignments (illustration) from each of the source databases are processed in an automated way to provide links from each aligned sequence to the corresponding, complete record in the Entrez Protein database. Occasionally, sequences that cannot be identified in Entrez's databases are omitted or substituted for closely related matches. Whenever possible, sequences in PFAM, SMART, and COGs alignments are substituted for closely related sequences that have direct links to three-dimensional structures in the Moleclular Modeling Database (MMDB).

A representative sequence is chosen for each domain model, preferably with a structure-link, for technical reasons. The representative sequence is generally shown as the first member of the multiple sequence alignment for a domain model. By default, this representative is the 3D structure shown when CD alignments are visualized with Cn3D.

A consensus sequence is computed from the imported alignments. Alignment columns have to be represented in at least 50% of all aligned sequences (weighted by diversity) to determine the extent of the consensus. The most frequently occurring residue in each column (after weighting to account for redundancy) is reported.

A position-specific scoring matrix (PSSM) is calculated for the extent of the consensus sequence. The PSSM profiles the various amino acids that were present in a given position of the multiple sequence alignment and how frequently each one was observed. The consensus sequence does not contribute to the residue frequency statistics. Each PSSM receives a unique identifier (PSSM ID).

A PSSM ID is the unique identifier for a domain model's position-specific scoring matrix. If a domain model's PSSM changes in any way as a result of updates to its multiple sequence alignment, it receives a new PSSM ID. This happens because a conserved domain model can evolve over time. For example, as new sequence data become available, the curators of a source database might add sequences to a multiple sequence alignment or update the sequences already present. As a result of such changes to the domain model, the PSSM and its ID can change.

Search databases compiled of these PSSMs are available through the CD-Search service (help document) and on the NCBI FTP site as collections of pre-computed RPS-BLAST databases that can be used for locally installed versions of that program.

What is unique about NCBI-curated domains? back to top

Example of 3-dimensional structure: Cl- binding residues in Voltage-Gated Chloride Channel, cd00400.

As noted in the section on CDD data sources, the goal of the NCBI conserved domain curation project is to provide database users with insights into how patterns of residue conservation and divergence in a family relate to functional properties, and to provide useful links to more detailed information that may help to understand those sequence/structure/function relationships. To do this, CDD Curators include the following types of information in order to supplement and enrich the traditional multiple sequence alignments that form the foundation of domain models:

3-dimensional structures and conserved core motifs:   NCBI Conserved Domain Curators have re-evaluated and modified multiple sequence alignments imported from outside sources, and made them agree with what we can infer from three-dimensional structure and three-dimensional structure superposition. Curated alignments contain aligned blocks spanning all rows (with no gaps allowed inside blocks) and unaligned regions between blocks. The blocks are meant to represent conserved structural core motifs of the corresponding domain family. The 3D structures can be viewed interactively with the Cn3D structure viewing program. More information about viewing structures is provided in the section of this document on CD summary pages, and the illustration at the right provides an example of a protein structure that has been annotated by NCBI curators to highight the Cl- binding residues.

Conserved features/sites:   In addition to working on the alignment model (illustration), NCBI curators also record, when possible, the location and nature of features conserved in the domain family. Typically these would describe catalytic residues, binding sites, or motifs commonly referred to in the literature.

Examples of Conserved Features (Sites) in Voltage-Gated Chloride Channel, cd00400, including Cl- selectivity filter, pore-gating glutamate residue, Cl- binding residues, and dimer interface.  Click anywhere on the image to open the complete, interactive record for this domain model in the Conserved Domain Database (CDD).

Features are added if they seem applicable to the family described in the CD's scope and if there is evidence linking the feature to a set of addresses on the alignment. Such evidence is recorded and available for inspection; it may be free-text comments, citations linked to PubMed, or "structure evidence" - exemplifying the existence of a site by highlighting an actual molecular complex, for example. Both features and evidence can be visualized on CD summary pages (in the conserved features/sites summary box and as hash marks (#) in the multiple sequence alignment displays), and with the Cn3D structure viewing program. An example is shown in the illustration at the right. In addition, the CD-Search tool can be used to identify conserved features in a query protein sequence, designated by small triangles in the search results graphical summary, when such features can be mapped from the conserved domain annotations to the query sequence.

Phylogenetic organization:   Based on evidence from sequence comparison, NCBI Conserved Domain Curators attempt to organize related domain models into phylogenetic family hierarchies (A separate illustration and additional information are provided below.) The CDTree program used by NCBI curators can be downloaded in order to view NCBI-curated domains interactively and in greater detail.

Links to electronic literature resources:   NCBI curated domains also provide links to citations in PubMed and NCBI Bookshelf that discuss the domain. These references are selected by curators and, whenever possible, include articles that provide evidence for the biological function of the domain and/or discuss the evolution and classification of a domain family.

NCBI-curated domains can be recognized in CDD search results by their "cd" accession number prefix. It is also possible to limit CDD search results to domain models from any given source database by using the Database Search Field.

What is a domain family hierarchy? back to top

A domain family hierarchy is a set of related domains that share a common ancestor, a common set of conserved residues, and a common general function, but differ from each other in their specific phylogeny, specific functions, and additional spans of conserved residues. Domain hierarchies are present in NCBI-curated domains in order to provide insights into how patterns of residue conservation and divergence in a family relate to functional properties.

Some domain families have only a single node, while others have a hierarchy that is two or more levels deep, sometimes with numerous nodes at each level. Such hierarchies have generic "parent" models and more specific "children". The parent node contains a span of conserved residues that is also present in each of the children. Each of the child nodes can have additional conserved residues that extend beyond that span and help to further characterize the members of the child node.

NCBI CDD Curators attempt to split "children" nodes where they see evidence for ancient gene duplications resulting in orthologous groups, often occurring together with functional divergence. The CDTree program used by NCBI curators can be downloaded in order to view NCBI-curated domains interactively and in greater detail, with or without a query sequence embedded.

Image of domain hierarchy showing divergence in a protein family based on phylogenetic relationships of protein sequences and functional properties. Click anywhere on the image to open an interactive view of the domain model in the Conserved Domain Database (CDD).
Click anywhere on the image to open the complete, interactive record for this domain model (cd00400) in the Conserved Domain Database (CDD).

What is a superfamily? back to top

A superfamily cluster is a set of conserved domain models that generate overlapping annotation on the same protein sequences. These models are assumed to represent evolutionarily related domains and may be redundant with each other. A superfamily accession number begins with the prefix "cl" for "cluster".
Clustering methodology:
Superfamily members are clustered through an automated process that involves the following steps:
  1. identify domain models that have overlapping hits on sequences in the Entrez Protein database from at least two different sentinel taxonomic nodes (e.g., high level nodes such as flowering plants, conifers, mollusks, flatworms, roundworms, annelid worms, insects, amphibians, mammals, etc.).
  2. pull those domain models into the superfamily
  3. if any domain models are part of an NCBI-curated family hierarchy, pull in all members of the hierarchy
  4. repeat steps 1-3 for each newly added domain until no more new models are pulled in
NOTE: Multi-domain models that were computationally detected are not included in Superfamily clusters. These models are likely to contain multiple single domains and might falsely join superfamily clusters.
Rationale:
Superfamilies provide a method for organizing data within CDD in a non-redundant way. CDD contains conserved domains from a number of different source databases, each of which may have its own model for a given conserved domain. The models might share many similiarities in their reported residue conservation patterns, but differ in the specific protein sequences used in the multiple alignment, their footprint length [domain boundaries], and biological annotations. Because of the similarities, RPS-BLAST might find that multiple domain models align to the same general region of a query protein, but have different footprints and E-value scores relative to the query protein. If the footprints of two or more domain models overlap on the query, those models are clustered into the same superfamily, then the superfamily continues to be extended using the methodology described above.
Example:
One example of a superfamily is Cluster ID cl02915, which contains various domain models for the voltage-gated chloride channel. Superfamily members include the NCBI-curated domain cd00400 and all members of that family hierarchy plus domain models from external resources.
Selection of Superfamily Representative:
A superfamily can contain one to many domain models. As of spring 2008, approximately 70% of the ~9,000 superfamilies contain a single model and the rest contain multiple models. Single model superfamilies often represent proteins specific to certain organisms or taxonomic lineages (for example, viruses). The numbers of superfamilies containing single or multiple domain models will continue to evolve as new domains are imported and new NCBI-curated hierarchies are added.

In superfamilies contatining multiple domain models, one of the models is selected as the source of the superfamily name and description. The representative is one of the following, listed in priority order:
  • the parent node of an NCBI-curated domain family hierarchy, if one is present in the superfamily cluster. In the few cases where a superfamily contains more than one NCBI-curated domain, the parent of the hierarchy with the largest number of sequence hits is chosen as the superfamily representative.
  • the Pfam domain model that hits the largest number of Entrez protein sequences in an RPS-BLAST search
  • the SMART, COG, PRK, or CHL model that hits the largest number of Entrez protein sequences in an RPS-BLAST search
  • the sole member of a superfamily
Superfamily can change over time:
The composition of a cluster can change over time due to a variety of factors, such as:
  1. availability of new domain models as the Conserved Domain Database continues to grow
  2. changes to previously existing models
  3. new and/or updated sequence records in the Entrez Protein database
  4. refinements to the automated clustering procedures
A superfamily cluster accession number will remain the same if at least 50 percent of its member models (conserved domain accessions) have not changed relative to the previous version of the cluster.

If more than 50 percent of the conserved domain accessions from a previous version of a cluster are no longer present in the new build of that cluster, or if the cluster size more than doubles with a new build, then the superfamily cluster accession is retired and replaced by a new accession(s). If two previous clusters merge into a single new cluster, the superfamily cluster accession number of the larger component cluster is used for the new grouping.

A superfamily also has a PSSM ID, which refers to the specific set of conserved domain PSSM IDs that comprise the superfamily, rather than to an actual position-specific scoring matrix for the overall superfamily.

The superfamily PSSM ID will change if there is any change to the set of member PSSM IDs relative to the previous version of the cluster (e.g., if a member conserved domain gets a new PSSM ID due to changes in its multiple sequence alignment, of if a new conserved domain model is added to the superfamily as the result of a CDD database update).

Search Tips: How to find conserved domains back to top

| Protein query sequence | Text term search | Protein → Conserved Domains links | Domain architecture |

Protein Query Sequence (CD-Search): back to top
Most users will explore conserved domains starting from CD-Search results for a protein of interest.

The query can be a protein sequence in FASTA format or the GI or Accession of a protein sequence that exists in the Entrez Protein database.

The search results will show the conserved domains found in the protein. The colored bars that depict the domain footprints (shown in both the concise display and full display of CD-Search results) are active hotlinks that open the corresponding CD summary pages with your query sequence embedded in the multiple sequence alignment of proteins used to create the domain model.

The second half of this help document provides details on how to use the CD-Search service, including input required and output shown.

Text Term Search in Entrez CDD: back to top
| allowable search terms | search results | additional tips: advanced search methods, search fields, quotes, wild card * |
Allowable search terms back to top
Conserved domains can also be searched by text term in the Entrez CDD database. The Entrez query interface allows searching for keywords, publication dates, and taxonomic span, accesssion numbers, and more. The search field summary table in this document shows the variety of terms that can be used to query the database. It is also possible to use quotes to force multiple terms to be searched as a phrase, and to use an asterisk (*) as a wild card to search for a word stem.

For example, search the Entrez CDD database for strings like "Kinase" or "pfam023*" or "Tetratrico*" to see how it works:
for

A number of techniques can be used to search the database, offering varying degrees of control over your query. The search methods summary table provides examples of basic and advanced searches. In basic searches, you can just enter one or more search terms without specifying search fields, Boolean operators, or other search criteria. These searches are quick and easy but can result in some extraneous hits. Advanced search methods, on the other hand, allow you to exercise greater control over your search, for example, by specifying which search field to use for each query term, limiting search results to a particular type of record or source database, or refining your search in other ways. A separate section of this help document describes the CDD search results.

(The PubMed help document and Entrez help document provide additional, general information about using the Entrez search system.)

Search Results: Document Summary (DocSum) Page back to top
The initial search results provide a list (document summary, or "DocSum") of the conserved domain records that contain your search term, which can appear in any field of the record, unless a search field was specified in the query.

Click on the accession number or thumbnail image of any record on the DocSum page to view its conserved domain (CD) summary page.

If desired, you can narrow your search by restricting the query to a search field of interest or adding more terms with a Boolean AND.

Alternatively, you can broaden your search by adding more terms (e.g., synonyms) to your query with a Boolean OR, or by following links to Superfamily Members.
Search Results: "Display" menu options: back to top
The "Display" menu on the DocSum (search results) page allows you to view output in the formats below. The "Display" menu options act upon all of the CDD records in the current window (default) or on the subset selected with checkboxes.

Format Description
Summary shows the conserved domain's:
  • accession number
  • thumbnail image indicating if the conserved domain includes a protein sequence from a 3D structure.
    If a 3D structure is included, the thumbnail will be a still graphic of the actual domain structure.
    If no 3D structure is available for the protein family from which the domain model was created, the thumbnail icon will show a schematic of a multiple sequence alignment.
  • short name, which concisely defines the domain
  • a text summary, which provides a synopsis of biological function and salient features of the domain
  • PSSMid
  • Brief shows the conserved domain's:
  • accession number
  • short name, which concisely defines the domain
  • PSSMid
  • UI List shows only the conserved domain's:
  • PSSMid
  • Additional options The other options in the Display menu are described in the section of this help document on Links to related data in Entrez

    The detailed view ("CD Summary page) for a conserved domain can only be viewed for one record at a time by following the link for that record's accession number.

    In addition to displaying CDD search results in various formats, the Display menu can also be used to retrieve related data in CDD and in other Entrez databases by selecting the "xxxxx Links" menu items. It therefore provides integrated access to data many different data types.

    For example, the Superfamily Member Links option will retrieve the other domain models in CDD that appear to be evolutionarily related to or redundant with the domains listed/selected on the page. The Protein, Structure, Gene, PubMed, etc. links traverse to associated data in those Entrez databases.
    Search Results: "Links" to related data: back to top
    The "Links" pop-up menus for an individual CDD record on the Docsum search results page allow you to retrieve related data for that particular domain model within the current database as well as related data from other databases in the Entrez system. Depending on the data available for a given conserved domain, these links can include Related CDs, Literature, Sequence, Structure, BioSystems, and Other Links.

    If you want to view the related data for a single conserved domain record (e.g., cd00400), select the desired option from the pop-up links menus that appear beside the domain's accession number. For example, if you open the "Related CDs" menu and select the link for Superfamily Members, you will retrieve the other domain models in the Conserved Domain Database that appear to be evolutionarily related to or redundant with cd00400.

    If you prefer to retrieve related data for multiple records of interest (rather than related data for only one domain record), use the "Links" options that appear beneath the "Display" menu near the top of the search results page. They will retrieve related data for all of the CDD records in the current window (default) or for the subset you have selected with checkboxes.

    The links also appear on the detailed view of a conserved domain record and are described in the help document section on "CDD Record (CD Summary page): What information is displayed for each domain model on its CD Summary page?" : "Links to related data in Entrez". Most links are accessible on both the DocSum search results page and the detailed view, with the exception of the Architecture and Books links, which are available only on the detailed views. The number and type of links that exist vary among CDD records, depending on the related data that are available for any given record.
    Search Methods back to top
    A variety of techniques can be used to search the database, offering varying degrees of control over your query. In some cases, they offer alternative ways of executing the same search (as is true for sample searches #4, #5, and #6 below), with each method offering different benefits. The search methods include:

    Method Description Example
    Basic
  • Just enter search terms without specifying search fields, other limits, or Boolean operators.


  • Open the Details folder tab on the search results page to see exactly how Entrez parsed and handled your query.
  • Search #1:

    mismatch repair eukaryotes

    will retrieve biosystems with those terms anywhere in the record.

    Some of the records might include aligned sequences from organisms other than eukaryotes because we did not limit that search term to the Organism search field. Because of this, we might also retrieve conserved domain records they happen to contain the term "eukaryotes" in a comment or some other field of the record.

    Similarly, the term "mismatch repair" can appear anywhere in the record.
    Advanced Advanced search methods allow you to exercise greater control over your search, for example, by specifying which search field to use for each query term, limiting search results to a particular source database, or refining your search in other ways. This can be done by using the folder tabs:

    image of the Limits folder tabimage of the Preview/Index folder tabimage of the History folder tab or by entering a Complex Boolean query

    Limits

  • The Limits folder tab allows you to view the list of search fields. You can enter each search term or phrase separately, selecting the desired search field for each one. Then use History to combine the searches, as shown below.


  • IMPORTANT NOTE: Once you have used a particular Limit, a small check box on the Limits folder tab will appear. As long as the checkbox remains activated, the Limit(s) that were just applied to your search will continue to be applied to all subsequent searches. Therefore, remember to deselect that checkbox if you do not want the limit applied to your next search. You can then select new limits for subsequent searches, if desired.

    image of the Limits folder that displays an activated check box, showing that the limits you most recently selected for a search are still in effect activated

    image of the Limits folder that displays a deactivated check box, showing that the limits you most recently selected for a search are no longer in effect deactivated


  • Search #2:

    On the Entrez CDD search page, click on the Limits folder tab, select the Text Word, enter the following query:

    "mismatch repair"

    and press "GO". That will retrieve only records which contain those terms in the conserved domain's description. The quotes surrounding the terms force them to be searched as a phrase.


    Search #3:

    Open the Limits tab again and clear your previous search. Change the search field selection to Organism, enter the following query:

    eukaryotes

    and press "GO". That will retrieve conserved domain models containing only eukaryotic sequences in their multiple sequence alignments (i.e., eukaryota will be the root taxonomic node of the sequences in a domain model's alignment).

    Now you can use the History folder tab to combine the searches, as shown in Search #4 below.

    History

    image of the History folder tab

  • The History folder tab displays the searches you have done in the current database.


  • Search History will be lost after eight hours of inactivity. (To save a search indefinitely, click on the search # and select "Save in My NCBI.")


  • You can combine or subtract searches from each other by entering the search numbers and the AND, OR, or NOT Boolean operators in the query box, for example: #2 AND #3. If the query contains several search numbers and Boolean operators, the Boolean operators are processed from left to right unless parentheses are used for nesting. If parentheses are used, the portions of the query in parentheses will be processed first, then the remaining Boolean operators will be processed from left to right.


  • The search numbers might not be consecutive if you have done intervening searches in other databases. This is because search numbers are assigned consecutively to all the searches you have done across the Entrez system, while "History" only shows the subset of searches you have done in a particular database.


  • See the PubMed help document for additional details about History.
  • Search #4:

    You can enter the following character string in the query box at the top of the History page to combine the two searches above:

    #2 AND #3

    That will retrieve records containing "mismatch repair" in the Text Word field and "eukaryotes" in the Organism field. Compare the retrieval from this search with that of the sample basic search above.

    (Note that your actual search numbers might be different from those shown here, if you did earlier searches in the Entrez system before trying these examples.)

    Preview/Index

    image of the Preview/Index folder tab

  • The Preview/Index page allows you to build your query step by step, adding a new search term and selecting a new search field at each step.


  • The text box at the top of the page shows your active query.
    The Preview button displays the number of hits your search retrieves and keeps you on the "Preview/Index" page so you can continue building your query.
    The Go button displays the records retrieved by your search (as a DocSum search results page).
  • The bottom area of the page is like a worksheet, where you can:
    (1) Select the Search Field of interest using the pull-down menu.

    (2) Enter a term(s) in the text box beside the search field menu.

    (3) Then press any one of the following buttons:

    The Preview button will simply add the new term/searchfield combination to the active query at the top of the page. (The term will be added with a Boolean AND; use the "OR" or "NOT" buttons if one of those Boolean operators is desired instead.)

    The Index will allow you to first browse the selected search field's index before adding a term(s) to the active query (see tips below).

    The AND, OR, NOT buttons will append the selected term(s) to the active query. If multiple terms are selected from the index window, they will be ORed together before being appended to the active query.
  • Tips on using the Index button on the "Preview/Index" page:
    The Index button allows you to browse the index of any Search Field. If you select a search field and press the Index button without entering a term in the box, you will be taken to the top of the index. If you enter a term first, you will be taken to the part of the index that contains your term (or the closest alphabetical location, if your term is not present in the index).

    The number of records that contain the term will appear in parentheses. You can also browse the index to explore the variety of terms available (for example, select "All Fields", enter "Huntington", and press the Index button to see additional spellings and/or related terms, such as Huntington disease, Huntington's, Huntington's disease).

    illustration showing how the Index button can be used to view the list of terms that are available in the selected search field

    To select a range of terms from the index, use the Shift key while selecting the first and last term. Then use the AND, OR, or NOT buttons to add that group of terms to the active query.

    To select multiple terms that do not fall within a continuous range from the index, use the Control key while selecting the terms of interest. Then use the AND, OR, or NOT buttons to add that group of terms to the active query.
  • Search #5:

    On the Entrez CDD search page, click on the Preview/Index folder tab, make sure the active query box at the top of the page is clear, then build your search one step at a time:

    (a) using the text box near the bottom of the page, select the Text Word search field and enter the following query:

    "mismatch repair"

    Press the "AND" button to move that term/search field selection up to the active query box at the top of the page.


    (b) again, using the text box near the bottom of the page, select the Organism search field and enter the following query:

    eukaryotes

    Then press the "AND" button to move that term/search field selection up to the active query box at the top of the page.


    (c) your query will now appear as:

    "mismatch repair"[Text Word] AND eukaryotes[Organism]

    Press the Preview button at the top of the page if you first want to see the number of documents your query retrieves, or press the Go button to view the records retrieved by your search (as a DocSum search results page).


    Note that this search will produce the same results as sample searches #4 and #6. It is simply executed in a different way, i.e., you remain on a single query page (Preview/Index) and can browse the index of any search field as you build your query one step at a time.
    Complex Boolean
  • Enter your search directly into the query box using command language to indicate search terms and the desired search fields. The syntax is:

    term[field] BOOLEAN term[field] BOOLEAN term[field] etc.

    An example is shown in the next column.


  • Place Search Field qualifiers in square brackets [].  See also some additional tips for specifying search fields.


  • Boolean operators (AND, OR, NOT) must be written in UPPER CASE.


  • Boolean operators are processed from left to right unless parentheses are used for nesting. If parentheses are used, the portions of the query in parentheses will be processed first, then the remaining Boolean operators will be processed from left to right.


  • Boolean operators can also be used to combine or subtract searches from each other on the History page by entering the search numbers and desired Boolean operators in the query box, for example: #2 AND #3.


  • Search #6:

    Simply enter all search terms and search fields as a single statement into the query box:

    "mismatch repair"[Text Word] AND eukaryotes[Organism]

    Note that this search will produce the same results as sample searches #4 and #5, but it takes only a single step when entered directly into the search box as a Boolean query.


    Search #7:

    ("chloride channel"[All] OR ClC[All]) AND (cdd[Database] OR pfam[Database])

    This search will retrieve biosystem records that contain the phrase "chloride channel" or the abbreviation "ClC" in any field of the record, and that are from the NCBI-curated or PFAM source databases.

    Additional details about search methods and options, such as the Clipboard, Details, and My NCBI functions, are provided in the PubMed help document and Entrez help document.


    Search Fields back to top
    Search fields can be selected from pop-up menus on either the Limits and Preview/Index page, or can be typed directly in your query (surrounding field names with square brackets [], for example, [Organism] or [Orgn]).* The Index button on the Preview/Index page allows you to browse the index of each search field, where you can see the available terms, the number of records containing each term or phrase, as well as the syntax for entering values in search fields such as Modification Date or Publication Date.

    The currently available fields include:

    Field name Abbreviation* Description Sample Search
    All Fields [all] Searches the complete database record "chloride channel"[All]

    will retrieve the CDD records that contain the phrase "chloride channel" in any field of the record.

    The quotes surrounding the search terms ensure they are searched as a phrase.**
    Accession [accn] Searches only the accession number of the record, which is always an alphanumeric combination. The accession number prefix indicates the source database. The accession number applies to the complete conserved domain record.

    Note: An additional unique identifier, the PSSM ID, is assigned to the position specific scoring matrix that is derived from the conserved domain's multiple sequence alignment. Conserved domains can also be retrieved by entering their PSSM ID (without a search field specifier).
    cd00400[Accn]

    will retrieve the CDD record that contains the specified unique identifier in the accession number field.
    Alternative Accession [AltAccn] Native accession format from an external source database. For example, the PFAM database uses accessions with a format such as pf08617. When these are imported into CDD, the accessions are represented in a format such as pfam08617. Similarly, the SMART database uses a format such as sm00100, while records that have been imported into CDD have a format such as smart00100. This is primarily done to indicate that SMART and PFAM domain alignments may have been modified slightly by NCBI staff, for example by the substitution of a protein sequence that does not have 3D structure with a highly similar one that does (as explained in the help document section on the CD assembly process). pf08617[AltAccn]

    will retrieve the pfam08617 record from CDD.
    Database [db] Use this field to limit your search to a particular source database. cdd[db]

    will retrieve the NCBI curated domain models and superfamily records, which are also created at NCBI, from CDD.

    pfam[db]

    will retrieve the domain models that were imported from the PFAM database.
    Filter [filt] The "Filter" search field allows you to narrow your retrieval to records that have certain attributes, such as curated or uncurated, or records that have links to other Entrez databases of interest.

    Many attributes from the Filter field are provided in the "Links" menus that are present on an Entrez search results page, and in the "Links" box on an individual CD Summary page. A detailed explanation of each type of link is provided in the description of the "Links" box.
    cdd_gene[filt]

    will retrieve the CDD records that have associated data in the Entrez Gene database.

    On the CDD search results page, you can then open "Display" menu and select the Gene Links option to view the corresponding Entrez Gene records.
    Modification Date [mdat] Date of the most recent changes to the alignment model and/or descriptive information  
    Number of Sites [ns] The number of conserved features, such as catalytic or binding sites, that have been annotated on a domain. Conserved features are available on NCBI-curated domains.

    As of April 2008, this ranges from zero to 21 sites. (To see the current range, select the "Number of Sites" search field on the "Preview/Index" page, then use the "Index" button to view the index of that search field and see available values.)
    4[ns]

    will retrieve the NCBI curated domain models that contain four sites (i.e., four conserved features).
    Organism [Orgn] The root taxonomy node of a conserved domain. This is the highest node in the NCBI Taxonomy database that encompasses all organisms whose protein sequences are in the multiple sequence alignment for a domain model. eukaryotes[orgn]

    will retrieve conserved domains found in eukaryotes.
    PSSM Length [plen] Length of the PSSM or domain search model. This is the same as the length of the consensus sequence.  
    Publication Date [pdat] date a CD was published [create date = date at which the seed (or de-novo) alignment was imported into CDD; what is publication date = date of release into public version of CDD?]  
    Structure Representative [strp] The number of structures that have protein sequences in the multiple sequence alignment for a domain model.

    As of April 2008, this ranges from zero to 72 protein sequences from structures. (To see the current range, select the "Structure Representative" search field on the "Preview/Index" page, then use the "Index" button to view the index of that search field and see available values.)
    6[strp]

    will retrieve domain models that contain six protein sequences from 3D structures in their multiple sequence alignment.
    Text Word [word] The long description (text summary) of the conserved domain.    
    The Description of Sites [sd] Brief descriptions of conserved features.  
    Title [titl] The short name of a conserved domain, which concisely defines the domain.
    Example:  Voltage gated ClC (voltage gated chloride channel)
    voltage[titl]

    will retrieve the CDD records that have the term "voltage" as part of their short name, such as cd00400: Voltage gated ClC and pfam00654: Voltage CLC, which represent NCBI-curated and externally imported domain models, respectively, for the voltage gated chloride channel.
    UID [UID] Retrieves a conserved domain record by its PSSM ID. If you enter a string of digits as a query and do not specify a search field, the UID field will be searched by default. 79359[UID]

    will retrieve the conserved domain record cd00400, whose PSSM ID is 79359.

    79359

    will also retrieve that same conserved domain record, because the UID field is searched by default for queries that are only a string of digits.

    * In a query, the field name may be typed as the full name or abbreviation, and may be in upper, lower, or mixed case. It must be surrounded by square brackets []. A space between the search term and the field specifier is optional. If desired, surround a phrase with quotes to force an adjacency search. For example, the sample queries below will work equally:
          "chloride channel" [WORD]
          "chloride channel"[WORD]
          "chloride channel" [word]
          "chloride channel"[Text Word]

    ** The quotes surrounding the search terms in the All Fields example ensure the terms are searched as a phrase. If quotes are not used and the terms are not automatically recognized as a phrase by the Entrez system, Entrez will insert a Boolean AND between the terms and they may or may not appear adjacent to each other in the retrieved records. More search tips are provided in the PubMed help document and Entrez help document.

    It is also possible to search for a word stem by using an asterisk (*) as a wild card; for example, chlori* will retrieve records with terms such as chloride, chlorin, chlorinate, chlorite. The Entrez Help document provides additional information about truncating search terms in this way.

    Entrez Protein links to Conserved Domains: back to top
    All sequence records in the Entrez Protein database have been RPS-BLASTed against the Conserved Domain database. These pre-calculated search results are available as "Conserved Domains" links from protein sequence records, making protein functional information one click away from the sequence record.

    Domain architecture: CDART: back to top
    The Conserved Domain Architecture Retrieval Tool (CDART) program has been used to analyze the domain architecture of all sequence records in the Entrez Protein database, and to identify proteins with similar architecture. Those proteins are accessible by selecting "Domain Relatives" in the "Links" menu of a protein sequence record of interest (illustrated example).

    Or, you can search CDART directly by entering a query protein sequence in FASTA format, or entering the GI or Accession number of a protein sequence that already exists in the Entrez Protein database. CDART will then retrieve proteins that contain one or more of the domains present in the query sequence.


    CDD Record (CD Summary page):   What information is displayed for each domain model? back to top

    A CD-summary page provides the following information for a domain model (example: cd00400: voltage-gated chloride channel):
    1. text summary (synopsis of function)
    2. links to the source database, literature references, and related data in Entrez, as available
    3. conserved features (available for NCBI-Curated domains only)
    4. sequence cluster phylogenetic tree for protein sequences used to curate the domain (available for NCBI-Curated domains only)
    5. domain family hierarchy (available for NCBI-Curated domains only)
    6. (f) multiple sequence alignments of the proteins used to develop the domain model.
    Text Summary (synopsis of function): back to top
    • The text summary shown at the top of a CD summary page was written by curators at the source database and provides a synopsis of the domain's biological function. In NCBI curated domains, it also describes the taxonomic extent of the domain, whether it is a monomer or dimer, and any salient features. The text summary in a superfamily record is derived from the representative domain.
    Links to related data in Entrez:back to top
    • The "Links" for an individual CDD record allow you to retrieve related data for that particular domain model within the current database as well as related data from other databases in the Entrez system. The links are accessible from both the Docsum page page of search results, as pop-up menus, and from the "Links" box on an individual CD Summary page. Links that are present on only one of the pages are noted with an asterisk, below. The number and type of links that exist vary among CDD records, depending on the related data that are available for any given record.
    Link Name Description
    Representatives * The set of protein sequences that are present in the multiple sequence alignment for the domain model. Following the "Representatives" link will retrieve the complete sequence records from the Entrez Protein database. The number of records retrieved will be identical or similar to the number of Aligned Rows shown in the Statistics box of the CD Summary page.
    Specific Protein The set of protein sequences found by RPS BLAST to contain the domain (with an e-value that is equal to or lower than a domain-specific Threshold E-value). These are called specific hits and represent a very high confidence that the query sequence belongs to the same protein family as the sequences use to create the domain model. The number of proteins you will retrieve by following this link is greater than retrieval from the "Representatives" link, but less than retrieval from "Related Protein" link.
    Related Protein
    (Protein †)
    Superset of all protein sequences found by RPS BLAST to contain the domain (with an E-value equal to or better than the default cutoff of 0.01). Therefore, this superset includes two CD-Search hit types: specific hits and non-specific hits.
    Superfamily This links to the record for the CDD superfamily to which this domain belongs.
    Superfamily Members This retrieves all the other domain models that belong to the superfamily.
    Architectures
    (Architecture †)
    Proteins found by CDART to contain one or more of the domains present in the proteins that are hit by domains found in the domain superfamily
    Related Structure
    (Structure †)
    All of the protein sequences from 3D structure records that were hit by RPS BLAST by this domain model's PSSM
    BioSystems * BioSystems containing proteins that have specific hits to the conserved domain. The proteins that have been associated with the BioSystem via the method descibed in data processing/create direct links/proteins.
    Gene Links from the RPS BLAST concise display hits to the protein sequences listed in Entrez Gene records.

    Details: Each protein listed in an Entrez Gene record has been RPS BLASTed against the domain models in CDD. Links are then created between specific regions of those protein sequences and top-scoring domain models which align to them. Top-scoring domain models are shown either as specific-hits, or as the superfamily to which the highest-ranking non-specific hit belongs.
    HomoloGene Links from the RPS BLAST concise display hits to the protein sequences listed in HomoloGene records. (The details provided for Gene links, above, also apply to HomoloGene links.)
    PubMed PubMed citations annotated on the domain. All references have been identified by curators, either by NCBI staff for the NCBI-curated domains, or by the staff of the external databases represented in CDD.

    For NCBI-curated domains, the PubMed link leads to the citations that have been annotated on that particular node of a domain family hierarchy, not for all nodes in the tree. Whenever possible, the citations include articles that provide evidence for the biological function of a domain and/or discuss the evolution and classification of a domain family.
    Free in PMC The subset of PubMed links that are available as free full text in PubMed Central.
    Books * Full text information in the Entrez Books database that further clarifies or elucidates the domain's function, the protein's role in metabolic pathways, and other broad overview information, including diagrams and illustrations.
    Taxonomy The highest node in the NCBI Taxonomy database that encompasses all organisms whose protein sequences are in the multiple sequence alignment for a domain model. The taxonomy link for a superfamily retrieves the highest taxonomic node for all of its constituent domain models.

    * The "representatives" and "books" links are found only on the CD summary page for a conserved domain. They are not present in the "Links" menus on the Docsum page page that shows initial search results. Conversely, "biosystems" links currently appear only on the DocSum page.

    For brevity, the "Related Protein," "Architectures," and "Related Structure" are simply called "Protein," "Architecture," and "Structure", respectively, in the "Display" and "Links" menus on the Docsum page page that shows initial search results.
    Statistics: back to top
    Item Description
    PSSM-ID the unique identifier for the position-specific scoring matrix (PSSM) generated by RPS-BLAST for a given multiple sequence alignment. If the sequence alignment changes in any way, for example, if new sequences are added to the alignment, a new PSSM will be generated and will receive a new PSSM-ID.
    View PSSM Opens a separate window with a graphical view of the domain model's PSSM, showing the relative frequencies of various residues at each position of the domain model. This viewer was prepared as part of an NCBI course on Exploring 3D Molecular Structures.
    Aligned lists the number of rows in the sequence alignment. In general, each row comes from a different sequence record. However, sometimes two or more rows can be from the same GI number (i.e., same sequence record), if the sequence contains multiple instances of the domain.
    Status information about the CD's curation status. Curated models have been realigned by NCBI with consideration of 3D structure. Alignments imported from outside sources have not been changed (except for the import process detailed above)
    Created date at which the seed (or de-novo) alignment was imported into CDD
    Updated date of the most recent changes to the alignment model and/or descriptive information


    Structure: back to top
    Item Description
    "Structure View" Button The "Structure View" button in a conserved domain record opens the 3D structure(s), if available, of protein sequences used to curate the domain model. In order for the button to work, the Cn3D program must be installed on your computer. It is a a free helper application available for Windows, Macintosh, and Unix platforms. Installation takes only a couple of minutes and a tutorial describes the program's features and functions.

    In addition to displaying an interactive view of the 3D structure(s), Cn3D will also display the multiple sequence alignment of those and other proteins used in the curation of the domain model. The Cn3D structure view and sequence view windows communicate with each other, so highlighting residues in one window will also highlight those residues in the other window.

    As noted in the sections on the CD assembly process and unique features of curated domains, NCBI staff include protein sequences from resolved 3D structures (illustration) whenever possible in the multiple sequence alignment of a domain model.

    In a multi-level domain hierarchy, the 3D structures might be present in the parent node (e.g., cd00400) if they are not present in an intermediate or terminal node (e.g., cd03683). In that case, click on the parent node to view structures that have been specially annotated to highlight the conserved feature.

    You can click on any of the thumbnail structure images on a CD summary page to launch Cn3D. The thumbnail images in the conserved features summary box will launch a specially annotated view of the structure that highlights the particular feature of interest.

    However, 3D structures are not always available. If a domain model does not include any structure-based protein sequences, the "Structure View" button will still open Cn3D, but only the sequence viewer window will be populated with data.

    Controls in Cn3D will then allow you to manipulate the sequence alignment in various ways, if desired. For example, Cn3D offers column-specific coloring by sequence conservation when invoked with multiple alignment views. This is a convenient feature to study sequence conservation within a CD-alignment and to find out how well the aligned query fits the existing patterns of conservation and variability. The Cn3D tutorial provides more information on the controls available.

    Program Although the Structure View button provides the option of using an older version of Cn3D (3.0), the default choice is recommended because it uses the most recent public version of the program (currently Cn3D 4.1).
    Drawing Structures, when available, can be displayed in varying levels of detail. All Atoms will load a detailed model. This option transmits a large amount of structure data and loading the structures may therefore take some time. The Virtual Bonds setting displays C-alpha atoms only, with virtual bonds connecting them, and therefore transmits and loads more quickly.
    Aligned Rows By default, Cn3D will display a multiple sequence alignment of up to 10 proteins, starting with sequences whose 3D structures are shown, and then also including sequences from proteins that do not yet have a resolved structure. Use the "aligned rows" menu to increase that number up to 100 rows.


    Conserved Features/Sites summary box (available for NCBI-Curated domains only):back to top
    Sequence Cluster Phylogenetic Tree (available for NCBI-Curated domains only):back to top
    • Based on evidence from sequence comparison, NCBI Conserved Domain Curators attempt to organize related domain models into phylogenetic family hierarchies (details and illustration). Colors used in the sequence cluster phylogenetic tree correspond to colors used in the domain family hierarchy display.

      The Detailed View button on the CD summary page launches the sequence cluster view in a separate browser window, with more options for coloring and shading.

      Alternatively, you can download the CDTree program used by the NCBI curators in order to view the complete domain hierarchy interactively and in greater detail.


    • To view a query protein embedded into the sequence tree of a domain model, first use the CD Search tool to identify the conserved domains in the query sequence. Then click on the cartoon (colored bar) representing a domain of interest in either the Concise Display or Full Display of the CD-Search results page. That will open a CD Summary Page, which shows detailed information about the domain and provides an Interactive Display option for viewing the Hierarchy (an illustrated example is provided in the "How To" pages).

      To embed your query in the hierarchy, simply check the box for Add Query Sequence before pressing the "Interactive Display" button. (The free CDTree program must be loaded onto your computer in order for that button to work.) When the CDTree program opens, your query sequence will be highlighted in red. If the sequence tree is large, you might need to de-select the View/Fit to Screen option in CDTree's Sequence Tree window in order see the tree, and the placement of your query sequence, in detail. The CDTree help document is packaged with the software and provides details on how to use the program.


    • Algorithms used to generate the cluster diagram in CDTree: The sequence tree viewer in CDTree calculates and displays sequence trees for a set of selected alignment models, which may or may not be linked in a hierarchical fashion. Sequence trees are the graphical depiction of results from simple phylogenetic analysis of the alignment data. Methods available for distance calculation are percent identity, Kimura-corrected percent identity, score of aligned residues, score of optimally extended blocks, blast score for the aligned footprints and blast scores for full-length sequences; a variety of commonly used scoring matrices can be selected. For the sequence trees displayed on CDD web pages, we commonly use "score of aligned residues", where pair-wise alignment scores derived from our multiple sequence alignments, and scored via BLOSUM62, are converted into distances. Trees can be constructed via single-linkage clustering, neighbor joining, or the Fast ME method. We use neighbor-joining for all of the sequence trees displayed on web-pages.
    Domain Family Hierarchy (available for NCBI-Curated domains only):back to top
    • As noted in the description of NCBI curated domains, the goal of the curation project is to to provide CDD users with insights into how patterns of residue conservation and divergence in a family relate to functional properties. The CD summary page for an NCBI-curated domain shows the hierarchy (details and illustration) to which the currently viewed domain belongs.

      Some hierarchies have only one node, while others have many nodes organized into two or more levels. If a hierarchy has multiple nodes, you can click on another node of interest to view the CD summary page for that domain.

      Alternatively, you can download the CDTree program used by the NCBI curators in order to view the complete domain hierarchy interactively and in greater detail, with or without a query sequence embedded.
    Multiple Sequence Alignment Displays: back to top
    • Member proteins used to create domain model:
      By default, the sequence alignment display at the bottom of a CD summary page shows 10 of the most diverse members from the cluster of sequences used to create a domain model. (A sample multiple sequence alignment is shown in the illustration of cd00064: Furin-like domain in this help document, or you can open a domain model directly in CDD, such as cd00400: voltage-gated chloride channel.) The multiple sequence alignment display options (below) can be used to change the quantity and appearance of data displayed, and the CD-Search tool can be used if you'd like to embed a query sequence within the alignment.


    • Protein query sequence embedded in alignment:
      To view a query protein embedded into the multiple sequence alignment of a domain model, first use the CD Search tool to identify the conserved domains in the query sequence. Then click on the cartoon (colored bar) representing a domain of interest in either the Concise Display or Full Display of the CD-Search results page.


    • Display Options:
      By default, the multiple sequence alignment on a CD summary page is shown in hypertext format and displays up to 10 sequences that were used to curate the domain. The display format, number and type of sequence rows, and color scheme can be changed in the following ways:


    Display Option Description
    Format
    Hypertext Interactive view in which each accession or GI number links to the corresponding complete sequence record in the Entrez Protein database. Displays all residues in each sequence row, with aligned residues shown in upper case, unaligned residues in lower case, and variation in sequence length shown as dashes. A horizontal scale indicates the number of residues in the overall alignment. The numbers at the beginning and end of each sequence row indicate the span of sequence data that was imported from the complete protein sequence record.
    Plain Text This view contains the same content as "Hypertext" but is rendered in ASCII format.
    Compact Hypertext Interactive view in which each accession or GI number links to the corresponding complete sequence record in the Entrez Protein database. Shows only aligned residues (as upper case letters), plus the number of intervening unaligned residues in each sequence row (shown in square brackets []). Does not show the unaligned residues themselves; those are shown only in the "Hypertext" and "Plain Text" format.
    Compact Text This view contains the same content as "Compact Hypertext" but is rendered in ASCII format.
    mFASTA Multiple FASTA (mFASTA) format is useful for importing the data into sequence analysis programs. For each sequence row in the alignment, it provides a FASTA-formatted definition line ("FASTA defline") followed by up to 80 characters of sequence data on each subsequent line. mFASTA format displays all residues in each sequence row, with aligned residues shown in upper case, unaligned residues in lower case, and variations in length filled in with dashes.
    Row Display
    Number of rows
    in a domain model
    The total number of sequence data rows aligned in a domain model are shown in the statistics portion of that model's CD summary page.
    Default number shown By default, 10 rows of sequence data are shown, including the representative sequence plus nine others.
    Maximum number shown You can change the number of sequence rows displayed using the Row Display pop-up menu. If the Type Selection is set to Most Diverse Members, a maximum of 100 rows can be displayed. If a domain model contains more than 100 rows, the Type Selection Top Listed Sequences allows the display of more than 100 rows. If a model is NCBI-curated, you can also use the CDTree program to view the complete set of rows. Simply install the program, which is free, then press the Interactive Display button in the hierarchy section of the domain model's CD summary page to view all the sequence rows.

    Note: In general, each row comes from a different sequence record. However, sometimes two or more rows can be from the same GI number (i.e., same sequence record), if the sequence contains multiple instances of the domain.
    Type Selection
    Most Diverse Members Lists the representative sequence followed by the most dissimilar protein sequences, as determined from the domain model multiple sequence alignment. They are listed from most to least dissimilar with respect to the representative sequence.
    Top Listed Sequences Merely refers to the order in which the sequences are listed in the multiple alignment; this may or may not be meaningful, depending on the approach used by the source database in curating a particular domain model.

    In NCBI-curated domain models, protein sequences from resolved 3-D structures are generally listed first, so the "Top Listed Sequences" display option is useful for bringing these structure-based protein sequences to the top when viewing NCBI-curated domains. The remaining sequences in NCBI-curated domain models are listed in order of increasing GI number or some other non-biological criterion. (This is because the composition of the member sequences, not their order, is important in determining a domain model's position-specific scoring matrix, or PSSM. The other important factor is the degree of residue conservation in any given column of the alignment, which can be visualized with the Color Bits setting, described below.)

    The biological relationships among the member sequences of an NCBI-curated domain model are displayed in the sequence cluster phylogenetic tree and the domain family hierarchy on the domain model's CD summary page. Both of these displays can also be viewed interactively using the CDTree program.
    Color Bits
    General Color Bits allow you to adjust the red <-> blue balance of color used to depict the degree of conservation among aligned (upper case) residues. In general, red indicates highly conserved and blue indicates less conserved. Unaligned (lower case) residues are shown in grey.

    The color bit settings can be used to select a threshold for determining which columns are colored in red.

    Numerical settings Higher numbers require higher degrees of conservation within an alignment column (i.e., less residue variation) in order to display that column in red font.

    Background: Each column in the multiple sequence alignment display receives a score that indicates that column's "information content" -- its contribution to the overall alignment score -- indicating how important the column is as an "anchor" for the alignment. The higher the score, the more important that column is in the alignment.

    We use a fairly standard definition of "information content" for an aligned column:
               SUM          (f(i) * log (f(i)/q(i))
    
               over all            base 2
               residue
               types i
    where f(i) is the observed relative residue frequency, and q(i) is the background/reference relative frequency for that residue type (based on the table that accompanies the BLOSUM62 matrix). This is also called "relative entropy", which is a popular way to measure the distributions of nucleotide bases or amino acids.

    A column's score is calculated on the fly, based on the sequence rows currently shown in the display. As the number and type of sequence rows in the display change, the column's score, and therefore its color, can change.

    The score threshold that must be met in order for an alignment column to be displayed in red can be adjusted from a low of 0.5 to a high of 4.0. As the threshold increases, the number of columns shown in red will decrease.
    Identity setting The Identity setting uses red font only in columns that contain the same residue in all of the sequence rows displayed. All other aligned columns are colored in blue and unaligned columns are shown in grey.
    "Feature" hash marks (#) Hash-marks (#) in the top row of a multiple sequence alignment display indicate the specific residues involved in a conserved feature, such as a binding or catalytic site, that has been annotated on an NCBI-curated domain.

    Although multiple features may have been annotated, only one feature at a time is shown in the multiple sequence alignment display.

    A conserved features/sites summary box (illustration) lists the features that have been annotated. Clicking on the tab for a feature of interest will show its details. It will also refresh the mutliple sequence alignment display to mark the residues involved in the currently viewed feature (as depicted in the bottom of the illustration).


    How and when is CDD updated? back to top

    CDD is updated several times a year. We no longer try to follow updates of the source databases on a regular basis, but will re-import source database content occasionally. CDD continues to grow, however, through NCBI's curation effort. At the moment, CDD curators focus on capturing and describing hierarchies of related domain families, which are, for the most part, covered by the imported un-curated models as well. The current curation effort is restricted to ancient domain families with wide phylogenetic distribution, and focuses on families with at least one 3D structure representative.

    Where can I send comments or feedback about the data? back to top

    The scientific community's understanding of molecular data continues to evolve as research progresses. Some domain models in CDD are generated through automated processes and others are curated. All are fluid and revised as new data become available and as new protein family clustering methods are developed. Because of this, we welcome your feedback on the data at info@ncbi.nlm.nihi.gov, including information/annotations you find particularly helpful as well as any discrepancies you may notice.

     
     
      CD-Search Help back to top  
     

    What is CD-Search, and what information can it provide about a protein? back to top

    The CD-Search service is a web-based tool for the detection of conserved domains in protein sequences. It can therefore help to elucidate the protein's function.

    The CD-Search service uses RPS-BLAST to compare a query protein sequence against conserved domain models that have been collected from a number of source databases, and presents results as a concise display (default) or full display.

    If CD-Search finds a specific hit, there is a high confidence in the association between the protein query sequence and a conserved domain, resulting in a high confidence level for the inferred function of the protein query sequence. The other types of hits that can be found also shed light on the putative function of the query protein.

    The CD-Search tool can also identify putative conserved features in a query protein sequence, when such features can be mapped from the conserved domain annotations to the query sequence. If conserved features are found, they designated by small triangles in the search results graphical summary, indicating the specific amino acids likely involved in functions such as catalysis or binding.

    What is RPS-BLAST? back to top

    The CD-Search service uses RPS-BLAST, which stands for "Reverse Position-Specific BLAST". This is a variant of the popular PSI-BLAST program ("Position-Specific Iterated BLAST"). PSI-BLAST finds sequences significantly similar to the query in a database search and uses the resulting alignments to build a Position-Specific Score Matrix (PSSM) for the query. With this PSSM the database is scanned again to eventually pull in more significant hits, and further refine the scoring model.

    RPS-BLAST uses the query sequence to search a database of pre-calculated PSSMs, and report significant hits in a single pass. The role of the PSSM has changed from "query" to "subject", hence the term "reverse" in RPS-BLAST.

    RPS-BLAST is the search tool used in the CD-Search service. The CD-Search service provides a web-interface to the RPS-BLAST program, the CD search databases, and interactive alignment visualization including 3D structures. A standalone version of the RPS-BLAST program is available as part of the NCBI toolkit distribution.

    What input is required to do a CD-Search? back to top

    Query Sequence: back to top
    To submit a query sequence to CD-search, you only need to provide the sequence, as raw sequence data, formatted as FASTA, or as a GI or Accession (valid in the NCBI Entrez system). Hitting the submit button will start CD-search with default settings for search sensitivity and display options. Note that CD-search only works for protein sequences.
    • Force Live Search - Use this option if your query is a GI or accession number of a protein sequence already in the Entrez Protein database and you prefer to see live rather than precalculated CD-Search results.
      • Normally, CD-search will display precalculated search results for queries that contain a GI or accession number of a sequence already in the Entrez Protein database. This is because CD-Searches are done as part of the automated processing of the Entrez Protein database, and the stored search results are readily available. If that is true for your query, the BLAST parameters information shown at the bottom of a Full Display of search results will say: Data Source: Precalculated Data.
      • A Live search is done automatically IF your query (a) is a FASTA formatted sequence, and the FASTA defline does not include a GI or accession number of a sequence record in Entrez protein, or (b) includes a GI or accession number but the output parameters have been changed from the default. If a Live Search was done, the BLAST parameters information shown at the bottom of a Full Display of search results will say: Data Source: Live blast search, RID = XXNNNXXNNN. The RID is a "request ID" and will enable you to retrieve the results of that particular search for the next 36 hours.
    Database Selection:back to top
    Currently, CD-Search is offered with the following search databases:
    • CDD - this is a superset including NCBI-curated domains and data imported from Pfam, SMART, COG, and PRK.
    • Pfam - a mirror of a recent Pfam-A database of curated seed alignments. Pfam version numbers do change with incremental updates. As with SMART, families describing very short motifs or peptides may be missing from the mirror. An HMM-based search engine is offered on the Pfam site.
    • SMART - a mirror of a recent SMART set of domain alignments. Note that some SMART families may be missing from the mirror due to update delays or because they describe very short conserved peptides and/or motifs, which would be difficult to detect using the CD-Search service. You may want to try the HMM-based search service offered on the SMART site. Note also that some SMART domains are not mirrored in CD because they represent "superfamilies" encompassing several individual, but related, domains; the corresponding seed alignments may not be available from the source database in these cases. Note also that SMART version numbers do not change with incremental updates of the source database (and the mirrored CD-Search database).
    • TIGRFAM - a mirror of a recent TIGRFAM set of domain alignments. An HMM-based search engine is offered on the TIGRFAM site.
    • COG - a mirror of the current COG database of orthologous protein families focusing on prokaryotes. Seed alignments have been generated by an automated process. An alternative search engine, "Cognitor", which runs protein-BLAST against a database of COG-assigned sequences, is offered on the COG site.
    • KOG - a eukaryotic counterpart to the COG database. KOGs are not included in the CDD superset, but are searchable as a separate data set.
    More information about each database is provided in the section on Where does CDD content come from?
    Advanced search options:back to top
    Note that advanced search options are only available when using the actual CD-Search form. Searches launched from the CDD Home page or together with protein BLAST requests use default search parameters.
    • Maximal Hits: limits the size of the hit list produced by CD-Search. Typically, for average sized proteins, the number of expected domain-hits is small and the default setting should be more than sufficient.


    • Expect Value: modifies the E-value threshold used for filtering results. False positive results should be very rare with the default setting of 0.01 (use a more conservative, i.e. lower setting for more reliable results), results with E-values in the range of 1 and above should be considered putative false positives.


    • Low Complexity Filter: By default, query sequences are filtered for compositionally biased regions. These are flagged as such and largely ignored during the search phase. If filtering is turned on, the graphical display of results highlights filtered-out regions on the query.


    Retrieve previous search with RID#:back to top
    If the CD-Search system did a live search, the BLAST parameters information shown at the bottom of a Full Display of search results will say: Data Source: Live blast search, RID = XXNNNXXNNN. The RID is a "request ID" and will enable you to retrieve the results of that particular search for the next 36 hours.


    What output is shown on the CD-Search results page? back to top

    The CD-Search results page provides the following display options and information for the conserved domains that align to your query sequence:

    Global Options: back to top
    The upper right corner of a search results page has a Show Concise Display / Show Full Display toggle switch that controls the level of detail shown in both the Graphical Summary (shown in the illustrations below) and List of Domain Hits (not shown in the illustrations below for brevity, but available on the actual, interactive CD-Search results page for the example featured in the illustrations).
    CD-Search results can include up to four hit types that represent various confidence levels (specific hits, non-specific hits) and scope (superfamilies, multi-domains) of domain hits.

    Concise Display: back to top
    The Concise display is the default output for CD-Search results and shows only the best scoring domain model, as available for each region on the query sequence, in each of three hit types: specific hits, the superfamily to which the highest-ranking hit belongs, and multi-domain models.
    If CD-Search finds both specific and non-specific hits for a region of a protein query sequence, only the highest ranking specific hit and its superfamily will be shown. If CD-Search finds only non-specific hits for a region of a protein query sequence, only the superfamily to which the hits belong will be shown, but not the non-specific hits themselves. The latter are provided only in the full display.

    CD-Search results concise display (default), which shows only the top-scoring hits for each region of the query sequence (protein GI 157830769, Cyclodextrin Glucanotransferase). Click anywhere on the graphic to open the actual, interactive CD-Search results page.
      The example above shows the search results, as of July 6, 2009, for protein GI 157830769 (Cyclodextrin Glucanotransferase). Click anywhere on the graphic to view the actual, interactive CD-Search results page.
    Hit types in the concise display can include specific hits, the superfamily to which the highest-ranking hit belongs, and multi-domain models. A separate section of this help document provides more information about the small triangles that represent conserved features/sites.
     

    Full Display: back to top
    The Full display shows all domain models, as available for each region on the query sequence, that meet or exceed the RPS-BLAST threshold for statistical significance (i.e., the E-value cutoff). The hit types can include specific hits, non-specific hits, the superfamily(ies) to which those hits belong, and multi-domain models.

    The bottom of the Full Display also summarizes BLAST search parameters, which include a summary of information such as the database which you searched against, whether the low complexity filter was used, the E-value cutoff, the BLAST software version number, and whether RPS-BLAST did a live search or retrieved precalculated search results. If a live search was done, the BLAST Request ID (RID) is shown at the bottom of the Full Display and allows you to retrieve the search results by RID anytime within 36 hours following the search, without having to re-execute it. (Only the top portion of the full display is shown in the image below, illustrating the components of the graphical summary.)

    CD-Search results full display, which shows all hits on each region of the query sequence (protein GI 157830769, Cyclodextrin Glucanotransferase). Click anywhere on the graphic to open the actual, interactive CD-Search results page.
      The example above shows the search results, as of July 6, 2009, for protein GI 157830769 (Cyclodextrin Glucanotransferase). Click anywhere on the graphic to view the actual, interactive CD-Search results page.
    Hit types in the full display can include specific hits, non-specific hits, the superfamily(ies) to which those hits belong, and multi-domain models. A separate section of this help document provides more information about the small triangles that represent conserved features/sites.
     

    Types of RPS-BLAST hits: back to top

    CD-Search results can include hit types that represent various confidence levels (specific hits, non-specific hits) and domain model scope (superfamilies, multi-domains). They can be seen in both the Concise display and Full display, except for non-specific hits, which are shown only in the Full Display.
    1. Specific hit meets or exceeds a domain-specific e-value threshold (details and illustration) and represents a very high confidence that the query sequence belongs to the same protein family as the sequences use to create the domain model. Therefore, there is also a high confidence level for the inferred function of the protein query sequence.


    2. Non-specific hit meet or exceed the RPS-BLAST threshold for statistical significance (default E-value cutoff of 0.01, or an E-value selected by the user with advanced search options). (NOTE: Non-specific hits are shown only in the full display (illustration) of search results. In contrast, the concise display (illustration) shows only the superfamily to which the top-scoring non-specific hit for a given sequence region belongs.)


    3. Superfamily is the domain cluster to which the specific and/or non-specific hits belong. This is a set of conserved domain models that generate overlapping annotation on the same protein sequences and are assumed to represent evolutionarily related domains. See additional details, including information about clustering methodology, under "What is a superfamily?"


    4. Multi-domains are domain models that were computationally detected and are likely to contain multiple single domains. They are typically shown as grey-colored bars. (Examples are shown in the concise display and full display illustrations.)

    Jagged Edges: back to top
    • What do domain cartoons with jagged edges mean?  Occasionally domain-cartoons have jagged edges. This means that the alignment found by RPS-BLAST omitted more than 20% of the CD's extent at the n- or c-terminus (or both, as indicated by the cartoons). This feature may give hints towards truncated query sequences, false-positive hits, or unusual domain architectures involving long insertions. The exact percentage of the CD's extent used in the alignment is listed in detail in the pairwise alignment section.

    Small Triangles back to top
    • What do the small triangles mean in the Graphical Summary?  

      The small triangles beneath the query protein on a CD-Search results page indicate the residues that comprise conserved features/sites, such as binding or catalytic sites, as mapped from the conserved domain annotations to the query sequence.

      The triangles appear if a region of the query protein sequence either:
      The triangles are shown in the same color as the domain on which they have been annotated.

      Click on the triangles to view details about the feature, including a multiple sequence alignment of your query sequence and the protein sequences used to curate the domain model, where hash marks (#) above the aligned sequences (illustration) show the location of the conserved feature residues. A thumbnail image, if present, provides an approximate view of the feature's location in three dimensions and options for interactive 3D structure viewing.

      Conserved features/sites, if present, are shown by default in the graphical display. If desired, they can be hidden by clicking on show options in the graphical summary header bar, then deactivating the show site features checkbox and pressing the update button.

    Image showing small triangles that sometimes appear in CD-Search results.  The triangles point to specific residues involved in conserved features, such as binding and catalytic sites, as mapped from a conserved domain to the query protein sequence.  Click anywhere on the graphic to open the actual, interactive CD-Search results page.
      The example above shows the search results, as of August 25, 2008, for protein GI 4557757 (human MutL protein homolog 1, associated with colon cancer). Click anywhere on the graphic to view the actual, interactive CD-Search results page.
    Hit types in the full display can include specific hits, non-specific hits, the superfamily(ies) to which those hits belong, and multi-domain models. If conserved features/sites are present, triangles are shown in both the concise display and full display.
     

    Horizontal Zoom back to top
    • If a query sequence is very long and contains many domains (e.g., human titin isoform N2-B, gi 110349715), the details of the graphical summary might be difficult to read. In that case, you can click on show options in the graphical summary header bar, enter the desired magnification level in the horizontal zoom box, and press the update button to refresh the display.

      There is no specific maximum value that can be entered in the horizontal zoom box. Rather, the limit is determined by the pixel width of the graphic image displayed.
      • If the zoom value you enter is too large, the system will display the message: "invalid zoom factor". In that case, enter a smaller zoom value.
      • There might be other cases in which the zoom value is acceptable but it takes some time to generate the display. In such cases, you might get an option to stop script or continue. Choose the latter if you would like the process of generating the enlarged graphic display to continue.

    Refine Search back to top
    • The Refine Search button on a CD-Search results page allows you to modify your query to search against a different database and/or use advanced search options.

    Search for Similar Domain Architectures back to top

    What is a Specific Hit? back to top

    A specific hit is a high confidence association between a protein query sequence and a conserved domain, resulting in a high confidence level for the inferred function of the protein query sequence. It is one of four types of RPS-BLAST Hits. (See illustrations of CD-Search results concise display and full display for examples.)
    In order to be considered a specific hit, an alignment of a domain model to a query protein sequence must meet two criteria:
    1. The domain model must be an NCBI-Curated domain. These are domains for which fine-grained evolutionary relationships, conserved sequence blocks, and specific functions have been annotated based on careful review of sequence data, 3D structures, and literature. NCBI-curated domains are also annotated with conserved features, noting the specific residues within the domain that are involved in catalysis or binding.  If multiple NCBI-curated domain models align to a given interval on the query protein sequence, then the highest-scoring model from that set is the specific hit.


    2. The E-value of that RPS-BLAST hit must be equal to or lower than a domain-specific threshold E-value.
      That threshold is the largest E-value obtained when each of the protein sequences used to curate a domain are RPS-BLAST'ed against that domain's Position-Specific Scoring Matrix (PSSM). In other words, the threshold is the weakest E-value among self-hits of a domain’s member protein sequences to the resulting domain model. (see illustration)
    If both conditions ARE met, then:
    • there is a high confidence level that the query protein sequence is a member of the protein family represented by the domain model and has the specific function annotated on that domain.
    • If the query sequence resides in the Entrez Protein database, the inferred function is annotated as "region" on the protein sequence record, showing the name of the high-scoring domain model and its base span. If the domain model includes conserved features (residues involved in catalysis or binding), those are annotated on the protein sequence record as "sites".
    If both conditions are NOT met, but the query sequence has an otherwise statistically significant hit (E-value cutoff of 0.01) to any domain in CDD, it is regarded as a non-specific hit. In that case:
    • the general function of the domain superfamily can be inferred for the query protein sequence, but the specific function is less certain.
    • If the query protein sequence resides in the Entrez Protein database, the name and general function of the domain superfamily is annotated in the protein sequence record (as a "region"). That text is derived from the domain model which has been selected as the superfamily representative. (The process by which the representative is selected is described in the help document section on superfamily.) Conserved features ("sites") are also annotated on the protein sequence record if the superfamily representative is an NCBI-curated domain that has such annotations.
    Method for determining the domain-specific E-value threshold score for RPS-BLAST.  Each protein sequence that was used to curate a domain model is RPS-BLASTed against the domain model's PSSM.  The highest (i.e., weakest E-value) among the member sequences is the domain-specific Threshold score. If a protein query sequence is RPS_BLASTed against CDD and receives an E-value score equal to or lower than the threshold, that protein is considered a specific hit.
    * In the actual calculation of domain-specific thresholds, bit scores are used rather than E-values. (Bit score is defined in the BLAST glossary and Field Guide glossary.)
    NOTE: This image reflects the cd03683 domain alignment as of April 20, 2008. The scientific community's understanding of molecular data continues to evolve as research progresses, and as new as well as updated sequence data are regularly deposited into the databases. If a member sequence used in a domain alignment is later superceded by an updated version, the new sequence data and gi number will replace the old one during review/update cycles of curated domains. Some revisions to sequence data, such as upstream or downstream extensions, do not affect the domain model, but the gi number and amino acid span will change to reflect the updated sequence record.

    How long do I have to wait for CD-Search results? back to top

    CD-search requests are submitted to the BLAST servers immediately. A typical search should take a few seconds only, depending on the size of the search database chosen, the length of the query sequence, and the load on the servers. Click here to test response time with a typical query.
    CD-Search requests can also be sent to the BLAST Queuing system (this happens by default for searches launched in parallel with protein BLAST requests), use the optional button at the bottom of the CD-Search page. Requests sent to the query will take longer, but the results can be retrieved at a later time using the RID ("Request ID"), without having to re-calculate the search. A form at the bottom of the CD-Search page can be used to retrieve earlier search results by RID.

    When do search requests end up in the BLAST-Queue? back to top

    When CD-search is run as an integral part of protein-BLAST search requests, the jobs are put in the BLAST queue and may take a little longer to complete (depending on the system load and length of query sequence). Queued CD-search will try to retrieve the finished results every few seconds until they are available. You may also store the request-id (RID) and retrieve results later here.

    How can I view multiple sequence alignments with my query sequence embedded? back to top

    When you click on the cartoon (colored bar representing a domain footprint) in the graphical display on the CD-search results page, an alignment view will be opened, which adds the query sequence to the multiple CD-alignment. It is possible to modify the number and type of sequences shown, as described in the help document section on CDD Record : multiple sequence alignment displays.

    Alignment visualization including 3D-structures back to top

    If you display an alignment view that includes a query sequence, you can also view the same alignment in the Cn3D program by pressing the Structure View button. (Cn3D installation takes only a couple of minutes and a tutorial describes the program's features and functions. The program must be installed in order for the Structure View button to work.)

    If a protein sequence from a 3D structure is included among the sequences used to curate a domain model, Cn3D will show the 3D structure as well. If the domain model includes sequences from more than one 3D structure, all of the structures will be displayed, superimposed upon each other, and their sequences will be displayed in the multiple sequence alignment.

    Cn3D offers column-specific coloring by sequence conservation when invoked with multiple alignment views. This is a convenient feature to study sequence conservation within a CD-alignment and to find out how well the aligned query fits the existing patterns of conservation and variability.

    Can I run RPS-BLAST locally? back to top
    How can I make my own search database for local searching?
    How can I get NCBI's CDD search database for local searching?

    Yes, you can run RPS-BLAST locally. A standalone version of RPS-BLAST is packaged in with the BLAST executables available on the NCBI FTP site, and is also available as part of the NCBI toolkit distribution (see ftp://ftp.ncbi.nih.gov/toolbox).

    Separate directories on the FTP site provide documents that describe each of the BLAST applications, including documents for RPS-BLAST and a Formatrpsdb application that can be used to build search databases that are properly formatted for use with RPS-BLAST.

    Pre-formatted search databases, which have already been processed by Formatrpsdb, are available on the CDD FTP site. A README file on the CDD FTP site also provides more details about customizing search databases.

     
     
      References back to top  
     

    Citing the Conserved Domain Database (CDD): back to top

    Marchler-Bauer A, Anderson JB, Chitsaz F, Derbyshire MK, DeWeese-Scott C, Fong JH, Geer LY, Geer RC, Gonzales NR, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, Lanczycki CJ, Liebert CA, Liu C, Lu F, Lu S, Marchler GH, Mullokandov M, Song JS, Tasneem A, Thanki N, Yamashita RA, Zhang D, Zhang N, Bryant SH. CDD: specific functional annotation with the Conserved Domain Database. Nucleic Acids Res. 2009 Jan; 37(Database issue): D205-10.

    Citing the CD-Search tool: back to top

    Marchler-Bauer A, Bryant SH. CD-Search: protein domain annotations on the fly. Nucleic Acids Res. 2004 Jul; 32(Web Server issue): W327-31.

    Additional References: back to top

    Additional articles are noted on the publications page for the conserved domains group of resources.

     
     
     | Revised 07 December 2009 | | Help Desk | Disclaimer | Privacy statement | Accessibility |
    NCBI Home NCBI Search NCBI SiteMap