The goal of ClusterCAD is to facilitate the informed design of chimeric type I modular PKSs in order to achieve the microbial production of novel drug analogs and industrially relevant small molecules. The synthetic biology community has long maintained an interest in polyketide synthases, as their modular nature suggests that their biosynthetic power can be harnessed for combinatorial biosynthesis. Previous work has demonstrated that it is possible to construct functional chimeric PKSs by exchanging catalytic domains between heterologous PKS modules. However, limitations in the theoretical understanding of the complex protein-protein interactions that govern the fold and function of natural and engineered PKSs mean that identifying strategies to reliably design functional PKS chimeras remains an open research problem. In this work, we assume a specific paradigm for domain exchanges, and further provide a database with advanced search capabilities aimed at smoothing the process of testing PKS design strategies.


  1. Domain exchange paradigm
  2. Description of database
  3. Browsing clusters
  4. Structure search
  5. Sequence search
  6. Development of ClusterCAD

Domain exchange paradigm

Type I modular polyketide synthases have a unique modular structure in which the product of each module, and therefore each megasynthase, is determined by the catalytic domains that comprise each module. This modular nature allows us to predict polyketide intermediates and final products using only the sequence of catalytic domains in the cognate PKS. This unusual property has fueled significant research efforts to use engineered PKSs combinatorial biosynthesis.

Due to the complicated nature of protein-protein interactions, one heuristic that is commonly used in PKS engineering is to seek to design chimeric PKSs that are as close to a naturally occuring PKS as possible. Using this guiding principle, we propose the following paradigm for designing a chimeric PKS capable of producing a small molecule compound of interest:

  1. Identify truncated PKS as a starting point for engineering by searching for a module with a known intermediate that as structurally similar to the compound of interest as possible
  2. Determine what domain exchanges are required to obtain the product of interest
  3. Choose donor catalytic domains on basis of sequence similarity to the truncated PKS starting point

If the target is a natural product analog, a starting point does not need to be identified, and ClusterCAD can be applied to simply select donor catalytic domains required to effect the desired structural changes in the final polyketide product. While the goal of ClusterCAD is to identify potential parent PKS starting points and donor catalytic domains, it will likely prove important to consider additional factors when designing a chimeric PKSs. For example, modules from well-characterized clusters, particularly modules that have previously been determined to be well-expressed in the host organism of choice, are particularly attractive choices for engineering. We therefore emphasize that ClusterCAD is intended to augment, rather than supercede, the expert domain knowledge of the experienced PKS researcher.

Description of database

ClusterCAD is based on the Minimum Information about a Biosynthetic Gene cluster (MIBiG) database. In order to construct the database entries for ClusterCAD, we first identified the MIBiG entries that were annotated as type I modular PKS clusters. Annotations for these clusters were generated using the antibiotics and Secondary Metabolite Analysis SHell (antiSMASH) software. The resulting output was parsed using a whitelist of recognized catalytic domains in order to truncate analysis of each cluster at a subunit containing a non-ribosomal peptide synthethase (NRPS) or another unusual catalytic domain that is otherwise not supported by ClusterCAD. Domain annotations, which include predictions for acyltransferase (AT) domain substrate specificity and ketoreductase (KR) domain stereochemical outcome, were then used to generate predictions of the polyketide intermediates expected to be produced by each module in the PKS cluster.

In order to validate the intermediate and final structure predictions, the predicted final structure was compared against the known final structure. SMILES structures for known file products were taken from the MIBiG database, or were identified using the ChemAxon Naming tool using the text description of the final structure from MIBiG. Finally, additional structures were obtained by a literature search and manually incorporated into ClusterCAD.

A comparison between the predicted in known final structures was used to manually curate each ClusterCAD entry to perform the following corrections:

Database entries

The entry for each cluster contains links to the corresponding MiBiG database and NCBI Nucleotide database entries. Buttons to display annotations of AT substrate specificity and KR stereochemical outcome are also provided. Clicking on the final product or polyketide intermediate chemical structures will display SMILES representations of these structures. Further, clicking on the name of the module will provide links to the NCBI Protein database entry for that module, the nucleotide and amino acid sequences for the module, and precomputed secondary structure and relative solvent accessibility annotations.

Note that AT substrates with an "_ACP" suffix in the name represent ACP linked substrates, whereas those without a suffix represent CoA linked substrates.

Structure search

The structure search tool was designed to enable the identification of a truncated PKS to use as a starting point for PKS engineering, and takes as input a small molecule chemical structure in the form of a SMILES string or a structure that is drawn in an interactive GUI. Matches to the query structure are ranked using AP (atom pair) descriptors and the Tanimoto coefficient similarity metric.

Sequence search

The sequence search tool was designed to enable researchers to select donor catalytic domains for domain exchange experiments. The tool was designed to enabled flexible queries, allowing researchers to test hypotheses regarding which domain-domain interactions may be important in facilitating successful domain exchanges. The sequence search tool takes as input a valid amino acid sequence, and performs a Blast search against a Blast database containing all of the subunits in ClusterCAD.

Development of ClusterCAD


ClusterCAD: a computational platform for type I modular polyketide synthase design.
Eng, C.H.*, Backman, T.W.H.*, Bailey, C.B., Magnan, C., Martin, H.G., Katz, L., Baldi, P., Keasling, J.D.
Nucleic Acids Research, 2017 Oct.
*co-first authors

ClusterCAD is open source software that can be freely downloaded under a BSD style license at ClusterCAD utilizes the following open source software:
  • Bootstrap
  • ChemDoodle
  • PostgreSQL
  • Python
  • Django
  • RDKit