CORE v1.1: onepot CORE is no longer just enumerated

When we launched the original onepot CORE, the goal was clear: build a chemical space that is both large and makeable. The first release encompassed 3.4 billion molecules, built on medicinal-chemistry-relevant reactions, ML-based feasibility scoring, and an automated synthesis stack that could take compounds of interest from route selection through purification and QC.

Today, we are launching onepot CORE v1.1. CORE v1.1 is a clear upgrade over our original CORE. The models are better, the lab is faster and more reliable, management of building blocks is significantly streamlined, and the CORE no longer comes in just one form. It is now available both as an enumerated dataset and as a synthon-based representation.

The original fully-enumerated version is still powerful. It gives you a classical representation of the CORE molecules and remains the fastest path when you already know what compounds you want to make. But billions of compounds are not easy to handle in downstream workflows. For things like molecular docking, brute-force exploration of the full dataset quickly becomes impractical.

To reduce that overhead, we have added a synthon-based representation of the CORE. Instead of forcing users to work only with billions of final SMILES, it exposes structure in the reagent space underlying the CORE grouped by reaction classes used to construct the space. For reactions such as amide coupling, this means organizing compatible building blocks into reaction-aware subsets where many amines and carboxylic acids can be paired with high likelihood of success. The synthon-based representation is much more compact, and screening-driven workflows can target synthons first before expanding to full target molecules. The natural problem with synthon-based representation is that not all possible reactant pairs lead to successful product formation, therefore, naive pairing of all possible reactant groups will provide a lot of infeasible molecules.

We designed a way of incorporating our ML feasibility models into synthon form leveraging the experimental data we collected over synthesis experiments while keeping CORE in a compact form. We did this by identifying groups of building blocks that genuinely tend to form products together. The synthon-based CORE is therefore built by clustering building blocks in a reaction-aware way. In practice, that means a more compressed and more searchable representation of the CORE compound space that plugs naturally into existing software built around synthons and reaction-aware search.

Synthon CompressionHow the approximation is built

From an enormous reagent-pair matrix to a compact reaction-aware synthon view

The pipeline starts from sparse pair-presence matrices, compresses the dominant co-reactivity structure with truncated SVD, and then clusters the left and right reagent spaces into high-coverage subsets that can be searched far more efficiently than the fully enumerated product list.

Expand

1. Enumerated space

Sparse reagent-pair matrix

amide coupling example

Each nonzero entry records that a specific left/right building-block pair appears in the enumerated CORE. For amide coupling alone, this starts as a 25,749 by 44,374 matrix. It is huge and sparse, but structured: compatible regions form pockets rather than random noise.

2. Truncated SVD

Compress the co-reactivity signal

77% signal retained

The pair-presence matrix is projected into a lower-dimensional representation. Keeping the dominant directions preserves the shared reactivity pattern while discarding the noise and raw dimensionality that make the full space unwieldy.

largest singular values kept

left

right

3. Synthon subsets

Cluster into searchable reaction blocks

50 × 50 clustered view

K-means is then run on the left and right SVD embeddings, and cluster-pair coverage is ranked. The result is a compact set of reaction-aware synthons that preserves most of the useful enumerated space while being far easier to search, dock, and prioritize. Below are rendered structures from selected amide-coupling clusters.

cluster-pair coverage across the full 50 × 50 view

left cluster structures

cluster 21 · 306 members

right cluster structures

cluster 24 · 455 members

Shown here with amide coupling: CORE v1.1 converts sparse pair-presence matrices into searchable synthon subsets through truncated SVD, left/right clustering, and cluster-pair aggregation across many high-coverage pairings.

We are also reporting benchmarking results for this approximation. In the amide-coupling analysis shown here, selecting the top 992 cluster pairs reaches 90.1% recall at 80.8% precision. That is still a substantial compression of the enumerated space without giving up the practical coverage that makes the CORE valuable in the first place.

We have also substantially improved the models behind the CORE. They now do a better job ranking makeable compounds, identifying where reaction protocols are likely to generalize, and prioritizing chemistry with the highest expected probability of success on our stack.

ML Progress

previousupdated

Model performance improved across all four reaction classes

Reaction classRelative performance

Amide coupling

Buchwald-Hartwig

Suzuki-Miyaura

Urea synthesis

In onepot CORE v1, we set out to make a large and diverse chemical space. In CORE v1.1, we made it more usable while maintaining the same scale. CORE v1.1 is built on more trustworthy feasibility estimates and more reliable reaction execution. Meanwhile, the synthon-based representation makes the compound space something you can actually compute over, not just admire.

That is where we think this category is going. Not bigger libraries for the sake of bigger libraries, but better interfaces between virtual chemicals and practical molecular execution.

If you want access to CORE v1.1 or want to talk through how the synthon representation fits into your workflows, reach out to hello@onepot.ai .