The goal of ClusterCAD is to facilitate the informed design of chimeric type I modular PKSs and NRPSs in order to achieve the microbial production of novel drug analogs and industrially relevant small molecules. The synthetic biology community has long maintained an interest in polyketide synthases and related megasynthases, as their modular nature suggests that their biosynthetic power can be harnessed for combinatorial biosynthesis. For example, previous work has demonstrated that it is possible to construct functional chimeric PKSs by exchanging catalytic domains between heterologous PKS modules. However, limitations in the theoretical understanding of the complex protein-protein interactions that govern the fold and function of natural and engineered PKSs mean that identifying strategies to reliably design functional PKS chimeras remains an open research problem. In this work, we assume a specific paradigm for domain exchanges, and further provide a database with advanced search capabilities aimed at smoothing the process of testing PKS and NRPS design strategies.


  1. Domain exchange paradigm
  2. Description of database
  3. Browsing clusters
  4. Domain architecture search (Domain search)
  5. Structure search
  6. Sequence search
  7. Citing ClusterCAD

Domain exchange paradigm

Type I modular polyketide synthases and nonribosomal peptide synthetases have a unique modular structure in which the product of each module, and therefore each megasynthase, is determined by the catalytic domains that comprise each module. This modular nature allows us to predict polyketide intermediates and final products using only the sequence of catalytic domains in the cognate PKS. This unusual property has fueled significant research efforts to use engineered PKSs and NRPSs in combinatorial biosynthesis.

Due to the complicated nature of protein-protein interactions, one heuristic that is commonly used in PKS/NRPS engineering is to seek to design chimeric PKS/NRPSs that are as close to a naturally occuring PKS/NRPS as possible. Using this guiding principle, we propose the following paradigm for designing a chimeric PKS/NRPS capable of producing a small molecule compound of interest:

  1. Identify truncated PKS/NRPS as a starting point for engineering by searching for a module with a known intermediate that as structurally similar to the compound of interest as possible
  2. Determine what domain exchanges are required to obtain the product of interest
  3. Choose donor catalytic domains on basis of sequence similarity to the truncated PKS starting point

If the target is a natural product analog, a starting point does not need to be identified, and ClusterCAD can be applied to simply select donor catalytic domains required to effect the desired structural changes in the final polyketide or nonribosomal peptide product. While the goal of ClusterCAD is to identify potential parent PKS/NRPS starting points and donor catalytic domains, it will likely prove important to consider additional factors when designing chimeric PKS/NRPSs. For example, modules from well-characterized clusters, particularly modules that have previously been determined to be well-expressed in the host organism of choice, are particularly attractive choices for engineering. We therefore emphasize that ClusterCAD is intended to augment, rather than supercede, the expert domain knowledge of the experienced PKS or NRPS researcher.

Description of database

ClusterCAD is based on the Minimum Information about a Biosynthetic Gene cluster (MIBiG) database. In order to construct the database entries for ClusterCAD, we first identified the MIBiG entries that were annotated as type I modular PKS or nonribosomal peptide clusters. Annotations for these clusters were generated using the antibiotics and Secondary Metabolite Analysis SHell (antiSMASH) software. The resulting output was parsed using a whitelist of recognized catalytic domains in order to refine analysis of each cluster based on supported PKS/NRPS catalytic domains. Domain annotations, which include predictions for acyltransferase (AT) and adenylation (A) domain substrate specificity and ketoreductase (KR) domain stereochemical outcome, were then used to generate predictions of the polyketide or nonribosomal peptide intermediates expected to be produced by each module in the cluster.

In order to validate the intermediate and final structure predictions, the predicted final structure was compared against the known final structure. SMILES structures for known file products were taken from the MIBiG database, or were identified using the ChemAxon Naming tool using the text description of the final structure from MIBiG. Finally, additional structures were obtained by a literature search and manually incorporated into ClusterCAD.

A comparison between the predicted and known final structures was used to manually curate each ClusterCAD entry to perform the following corrections:

Database entries

The entry for each cluster contains links to the corresponding MiBiG database and NCBI Nucleotide database entries, as well as an indication for whether a cluster has been manually reviewed for consistency with experimental evidence. Cluster entries may also provide Cluster Notes, where curation notes and/or relevant publications and references may be viewed. Buttons to display annotations of AT or A substrate specificity and KR stereochemical outcome are also provided. Clicking on the final product or polyketide/peptide intermediate chemical structures will display SMILES representations of these structures. Further, clicking on the name of the module will provide links to the NCBI Protein database entry for that module, the nucleotide and amino acid sequences for the module, and precomputed secondary structure and relative solvent accessibility annotations if available.

Note that AT substrates with an "_ACP" suffix in the name represent ACP linked substrates, whereas those without a suffix represent CoA linked substrates.

Domain architecture search (Domain search)

The domain architecture search enables the design of a custom megasynthase enzyme by taking a desired domain architecture (sequence of modules and domains) as input, and searches ClusterCAD for the natural gene cluster which is the closest match and requires the fewest modifications to the design query. This is done by calculating the "Levenshtein Distance" which identifies the number of domain level deletions, insertions, or replacements to convert each hit into the query design. Note that only PKS domains are currently supported; we intend to release NRPS support shortly.

Structure search

The structure search tool was designed to enable the identification of a truncated PKS/NRPS to use as a starting point for PKS/NRPS engineering, and takes as input a small molecule chemical structure in the form of a SMILES string or a structure that is drawn in an interactive GUI. Matches to the query structure are ranked using AP (atom pair) descriptors and the Tanimoto coefficient similarity metric.

Sequence search

The sequence search tool was designed to enable researchers to select donor catalytic domains for domain exchange experiments. The tool was designed to enabled flexible queries, allowing researchers to test hypotheses regarding which domain-domain interactions may be important in facilitating successful domain exchanges. The sequence search tool takes as input a valid amino acid sequence, and performs a Blast search against a Blast database containing all of the subunits in ClusterCAD.

Citing ClusterCAD

If you use ClusterCAD for published research, please cite:

ClusterCAD 2.0: an updated computational platform for chimeric type I polyketide synthase and nonribosomal peptide synthetase design.
Tao, X.B., LaFrance, S., Xing, Y., Nava, A.A., Martin, H.G., Keasling, J.D., Backman, T.W.H.
Nucleic Acids Research, 2022 Nov.

ClusterCAD: a computational platform for type I modular polyketide synthase design.
Eng, C.H.*, Backman, T.W.H.*, Bailey, C.B., Magnan, C., Martin, H.G., Katz, L., Baldi, P., Keasling, J.D.
Nucleic Acids Research, 2017 Oct.
*co-first authors

ClusterCAD is open source software that can be freely downloaded under a BSD style license at ClusterCAD utilizes the following open source software:
  • Bootstrap
  • ChemDoodle
  • PostgreSQL
  • Python
  • Django
  • RDKit