Simulating 500 million years of evolution with a language model
GFP search term yields lots of ads with Google Search
https://colab.research.google.com/github/evolutionaryscale/esm/blob/main/examples/generate.ipynb
https://github.com/evolutionaryscale/esm
https://ca.finance.yahoo.com/news/evolutionaryscale-launches-esm3-milestone-ai-100000341.html
https://www.reddit.com/r/singularity/comments/1dole9a/esm3simulating500millionyearsofevolution/
https://www.php.cn/faq/1796510087.html
https://venturebeat.com/ai/meta-alum-launches-ai-biology-model-that-simulates-500-million-years-of-evolution/
https://press.aboutamazon.com/aws/2024/6/evolutionaryscale-launches-with-esm3-a-milestone-ai-model-for-biology
https://www.youtube.com/watch?v=aPCqzrscY4w&ab_channel=WesRoth
https://www.reddit.com/r/MachineLearning/comments/1do91g9/nesm3simulating500millionyearsof_evolution/
https://axios.com/2024/06/25/ai-biotech-generative-model-protein-design
https://evolutionaryscale-public.s3.us-east-2.amazonaws.com/research/esm3.pdf
https://www.youtube.com/watch?v=TiDo7xXMbUI
https://www.youtube.com/watch?v=N-eisTvUYrk
https://www.youtube.com/watch?v=19fy0we14XM
https://www.youtube.com/watch?v=7szFo_IPUcE
https://www.nature.com/articles/s41592-023-01790-6
https://www.recursion.com/news/demystifying-protein-structure-prediction-models-alphafold-rosettafold-esmfold-and-beyond
https://310.ai/2023/05/17/benchmarking-machine-learning-methods-for-protein-folding-a-comparative-study-of-esmfold-omegafold-and-alphafold/
https://analyticsindiamag.com/protein-wars-its-esmfold-vs-alphafold/
https://salvatore-raieli.medium.com/metas-esmfold-the-rival-of-alpahfold2-2223b67f6021
we used esmfold in htgaa 2023
here is a good student:
https://complex-bike-918.notion.site/Protein-Design-df2978ed760e4f368b0d236b40212b01
here is me:
https://sness.notion.site/Assignment-4-PDB-57ff9bcd127443408ba3766dba69052c
https://www.sciencedirect.com/science/article/pii/S0959440X23000684
OmegaFold [43] and ESMfold [44] are two implementations that seem similar to AlphaFold but without using the MSA. However, whether the predictions using these models use a single sequence can be questioned. The performance is significantly worse for (orphan) proteins that do not have many homologs in the sequence databases, i.e. the language models appear to memorise the MSA. ESMfold is computationally efficient and has been used to predict the structure of all proteins from an extensive meta-genomics database [45]. At CASP15, these methods performed significantly worse than AlphaFold.
Chain-of-thought sounds like just some scripts linked together.
Basically find one that is similar to a designed GFP that has poor fluorescence, and then make it somewhat better. It matures in days still though so not so great. Like is that useful at all?
Why did they release this? Why the rush to release it? (It's not on biorxiv yet.) Funding, need funding before AI crashes. By the way, IA (Intelligence Amplification/Augmentation) is the new AI.
Thomas Hayes 1 & Roshan Rao 1 & Halil Akin 1 & Nicholas James Sofroniew 1 & Deniz Oktay 1 & Zeming Lin 1 & Robert Verkuil 1 & Vincent Quy Tran 2 3 Jonathan Deaton 1 Marius Wiggert 1 Rohil Badkundri 1 Irhum Shafkat 1 Jun Gong 1 Alexander Derry 1 Raul Santiago Molina 1 Neil Thomas 1 Yousuf Khan 4 Chetan Mishra 1 Carolyn Kim 1 Liam J. Bartie 2 Patrick D. Hsu 2 3 Tom Sercu 1 Salvatore Candido 1 Alexander Rives 1 †
1 EvolutionaryScale, PBC 2 Arc Institute 3 University of California, Berkeley 4 Work done during internship at EvolutionaryScale, PBC
†Correspondence to arives@evolutionaryscale.ai.
Abstract
More than three billion years of evolution have produced an image of biology encoded into the space of natural proteins.
Here we show that language models trained on tokens generated by evolution can act as evolutionary simulators to generate functional proteins that are far away from known proteins.
We present ESM3, a frontier multimodal generative language model that reasons over the sequence, structure, and function of proteins.
ESM3 can follow complex prompts combining its modalities and is highly responsive to biological alignment.
We have prompted ESM3 to generate fluorescent proteins with a chain of thought.
Among the generations that we synthesized, we found a bright fluorescent protein at far distance (58% identity) from known fluorescent proteins.
Similarly distant natural fluorescent proteins are separated by over five hundred million years of evolution.
& Equal contribution
1 EvolutionaryScale, PBC
2 Arc Institute
3 University of California, Berkeley
4 Work done during internship at EvolutionaryScale, PBC
†Correspondence to
Preview 2024-06-25. Pending submission to bioRxiv. Copyright 2024 by the authors.
Introduction
The proteins that exist today have developed into their present forms over the course of billions of years of natural evolution, passing through a vast evolutionary sieve.
In parallel experiments conducted over geological time, nature creates random mutations and applies selection, filtering proteins by their myriad sequences, structures, and functions.
As a result, the patterns in the proteins we observe reflect the action of the deep hidden variables of the biology that have shaped their evolution across time.
Gene sequencing surveys of Earth’s natural diversity are cataloging the sequences (1–3) and structures (4, 5) of proteins, containing billions of sequences and hundreds of millions of structures that illuminate patterns of variation across life.
A consensus is building that underlying these sequences is a fundamental language of protein biology that can be understood using large language models (6–10).
A number of language models of protein sequences have now been developed and evaluated (9, 11–14).
It has been found that the representations that emerge within language models reflect the biological structure and function of proteins (6, 15, 16), and are learned without any supervision on those properties, improving with scale (5, 17, 18).
In artificial intelligence, scaling laws have been found that predict the growth in capabilities with increasing scale, describing a frontier in compute, parameters and data (19–21).
We present ESM3, a frontier multimodal generative model, that reasons over the sequences, structures, and functions of proteins.
ESM3 is trained as a generative masked language model over discrete tokens for each modality.
Structural reasoning is achieved by encoding three-dimensional atomic structure as discrete tokens rather than with the complex architecture and diffusion in three-dimensional space employed in recent predictive (22) and generative models (14, 23–25) of proteins.
All-to-all modeling of discrete tokens is scalable, and allows ESM3 to be prompted with any combination of its modalities, enabling controllable generation of new proteins that respect combinations of prompts.
ESM3 at its largest scale was trained with 1.07×1024 FLOPs on 2.78 billion proteins and 771 billion unique tokens, and has 98 billion parameters.
Scaling ESM3 to this 98 billion parameter size results in improvements in the representation of sequence, structure, and function, as well as on generative evaluations.
We find that ESM3 is highly responsive to prompts, and finds creative solutions to complex combinations of prompts, including solutions for which we can find no matching structure in nature.
We find that models at all scales can be aligned to better follow prompts.
Larger models are far more responsive to alignment, and PREVIEWSimulating 500 million years of evolution with a language model show greater capability to solve the hardest prompts after alignment.
We report the generation of a new green fluorescent protein (GFP) with ESM3.
Fluorescent proteins are responsible for the glowing colors of jellyfish and corals (26) and are important tools in modern biotechnology (27).
They share an elegant structure: an eleven stranded beta barrel with a helix that threads its center, which scaffolds the formation of a light-emitting chromophore out of the protein’s own atoms.
This mechanism is unique in nature—no other protein spontaneously forms a fluorescent chromophore out of its own structure—suggesting that producing fluorescence is hard even for nature.
Our new protein, which we have named esmGFP, has 36% sequence identity to Aequorea victoria GFP, and 58% sequence identity to the most similar known fluorescent protein.
Despite GFP’s intense focus as a target for protein engineering over several decades, as far as we are aware, proteins this distant have only been found through the discovery of new GFPs in nature.
Similar amounts of diversification among natural GFPs have occurred over predictable timescales.
Understood in these terms, the generation of a new fluorescent protein at this distance from existing proteins appears to be equivalent to simulating over 500 million years of evolution.
ESM3 reasons over the sequence, structure, and function of proteins.
All three modalities are represented by tokens, and are input and output as separate tracks that are fused into a single latent space within the model.
ESM3 is trained with a generative masked language modeling objective:
L = −Ex,m " 1 |m| X i∈m log p(xi |x\m) #
A random mask m is applied to the tokens x describing the protein, and the model is supervised to predict the identity of the tokens that have been masked.
During training, the mask is sampled from a noise schedule so that ESM3 sees many different combinations of masked sequence, structure, and function, and predicts completions of any combination of the modalities from any other.
This differs from the classical masked language modeling (28) in that the supervision is applied across all possible masking rates rather than a single fixed masking rate.
This supervision factorizes the probability distribution over all possible predictions of the next token given any combination of previous tokens, ensuring that tokens can be generated in any order from any starting point (29–31).
To generate from ESM3, tokens are iteratively sampled.
Starting from a sequence of all mask tokens, tokens can be sampled one at a time, or in parallel, in any order, until all tokens are fully unmasked (Fig. 1A).
Masking is applied independently to sequence, structure, and function tracks, which enables generation from any combination of empty, partial, or complete inputs.
ESM3’s training objective is also effective for representation learning.
We choose a noise schedule that balances generative capabilities with representation learning (Appendix A.2.2).
Tokenization enables efficient reasoning over structure.
Protein structures are tokenized by a discrete auto-encoder (32), which is trained to compress the high dimensional space of three-dimensional structure into discrete tokens (Fig. 1C).
We propose an invariant geometric attention mechanism to efficiently process three-dimensional structure.
The mechanism operates in local reference frames defined by the bond geometry at each amino acid, and allows local frames to interact globally through a transformation into the global frame (Appendix A. 1.6).
This mechanism can be efficiently realized through the same computational primitives as attention (33), and is readily scalable.
The local structural neighborhoods around each amino acid are encoded into a sequence of discrete tokens, one for each amino acid.
When predicting or generating protein structure, structure tokens output by ESM3 are passed to the decoder, which reconstructs the all-atom structure.
The autoencoder is trained to encode and reconstruct atomic coordinates with a geometric loss that supervises the pairwise distances and relative orientations of bond vectors and normals (Appendix A.1.7.3.1).
This tokenization delivers nearperfect reconstruction of protein structure (<0.3A RMSD ˚ on CAMEO, Fig. S3), enabling representation of structure at the input and output with atomic accuracy.
We also find that providing ESM3 direct access to atomic coordinates in the input via a geometric attention projection into the transformer improves the response to atomic coordinate prompts.
ESM3 can be conditioned on either or both of tokenized structure and atomic coordinates.
We supplement these structure representations with coarse grained tokens encoding secondary structure state (SS8) and solvent accessible surface area (SASA).
Function is presented to the model in the form of tokenized keyword sets for each position in the sequence.
ESM3 is a bidirectional transformer.
While extensive research has gone into creating specialized architectures and training objectives for proteins, we find that tokenization paired with a standard masked language modeling objective and the basic transformer architecture is highly effective for both representation learning and generative modeling.
Sequence, structure, and function tracks are input as tokens, which are embedded and fused, then processed through a 2 PREVIEWSimulating 500 million years of evolution with a language model Figure 1.
ESM3 is a generative language model that reasons over the sequence, structure, and function of proteins.
(A) Iterative sampling with ESM3.
Sequence, structure, and function can all be used to prompt the model.
At each timestep t, a fraction of the masked positions are sampled until all positions are unmasked.
(B) ESM3 architecture. Sequence, structure, and function are represented as tracks of discrete tokens at the input and output.
The model is a series of transformer blocks, where all tracks are fused within a single latent space; geometric attention in the first block allows conditioning on atomic coordinates.
ESM3 is supervised to predict masked tokens.
(C) Structure tokenization.
Local atomic structure around each amino acid is encoded into tokens.
(D) Models are trained at three scales: 1.4B, 7B, and 98B parameters.
Negative log likelihood on test set as a function of training FLOPs shows response to conditioning on each of the input tracks, improving with increasing FLOPs.
(E) Unconditional generations from ESM3 98B (colored by sequence identity to the nearest sequence in the training set), embedded by ESM3, and projected by UMAP alongside randomly sampled sequences from UniProt (in gray).
Generations are diverse, high quality, and cover the distribution of natural sequences.
stack of transformer blocks. The first transformer block also includes a geometric attention layer for atomic structure coordinate conditioning.
At the output of the model, shallow MLP heads project the final layer representation into token probabilities for each of the tracks.
The largest ESM3 model is trained on 2.78 billion natural proteins derived from sequence and structure databases (2, 34–37).
As a small fraction of structures have been experimentally determined relative to sequences, we leverage predicted structures (4, 5).
We also generate synthetic sequences with an inverse folding model (described in Appendix A.2.1.3) for all structures, including predicted ones. Function keywords are derived by predicting functional annotations from sequence using a library of hidden markov models (38).
Overall this increased training data to 3.15 billion protein sequences, 236 million protein structures, and 539 million proteins with function annotations, totaling 771 billion unique tokens.
Full details of the training dataset are described in Appendix A.2.1.8. We train ESM3 models at three scales: 1.4 billion, 7 billion, and 98 billion parameters.
In an initial series of experiments to evaluate representation learning performance in response to architecture hyperparameters, we find a greater response to increasing depth than to width.
This informed the choice of relatively deep networks for the final architectures, with the 98 billion parameter model incorporating 216 Transformer blocks (Appendix A.1.5). Scaling ESM3 from 1.4 billion to 98 billion parameters results in substantial improvements in the validation loss for all tracks, with the greatest improvements observed in sequence loss (Fig. 1D, Fig. S11).
These gains in validation loss lead to better representation learning (Table S7 and Fig. S8).
In single sequence structure prediction (Table S8) on CAMEO, ESM3 98B obtains 0.895 mean local distance difference test (LDDT) and surpasses ESMFold (0.865 LDDT).
Unconditional generation produces high-quality proteins—with a mean predicted LDDT (pLDDT) 0.84 and predicted template modeling score (pTM) 0.52—that are diverse in both sequence (mean pairwise sequence identity 0.155) and structure (mean pairwise TM score 0.48), spanning the distribution of known proteins (Fig. 1E, Fig. S13). Programmable design with ESM3 We explore the ability of ESM3 to follow complex prompts with different compositions. ESM3 can be prompted with instructions from each of its input tracks: sequence, structure coordinates, secondary structure (SS8), solvent-accessible surface area (SASA), and function keywords. This allows prompts to be specified at multiple levels of abstraction, from atomic level structure to high level keywords describing the function and fold topology, using the learned generative model to find a coherent solution that respects the prompt. We evaluate ESM3’s ability to follow prompts in each of the tracks independently. A set of prompts are constructed for each of the tracks using a temporally held out test set of natural proteins (Appendix A.3.7). We evaluated the resulting generations for consistency with the prompt and foldability, the confidence of the structure prediction TM-score (pTM) under ESMFold. We define consistency metrics for each track: constrained site RMSD (cRMSD) is the RMSD between the prompt coordinates and the corresponding coordinates in the generation; SS3 accuracy is the fraction of residues where three-class secondary structure between the prompt and generations match; SASA spearman ρ is the correlation between the SASA prompt and the corresponding region of the generation; keyword recovery is the fraction of prompt keywords recovered by InterProScan (38). Across all tracks, ESM3 finds solutions that follow the prompt, and have confidently predicted structures by ESMFold (pTM
0.8) (Fig. 2A). Unconditional generations reflect the distribution of natural proteins. Since we observed ESM3 can faithfully follow prompts, we reasoned that prompting could steer the model to generate proteins that differ from natural proteins. First we test the ability of the model to follow out-of-distribution prompts. We construct a set of prompts combining SS8 and SASA from held out structures (TM < 0.7 to training set). Under these prompts, while the model continues to generate coherent globular structures (mean pTM 0.85 ± 0.03), the distribution of similarities to the training set (as measured by TM-score and sequence identity) shifts to be more novel (average sequence identity to nearest training set protein < 20% and mean TM-score 0.48 ± 0.09; Fig. 2B top). To test the ability to generalize to structures beyond the distribution of natural proteins, we use secondary structure prompts derived from a dataset of artificial symmetric protein designs distinct from the natural proteins found in the training dataset (Appendix A.3.8). Similarly, ESM3 produces high confidence generations (pTM > 0.8, pLDDT > 0.8) with low sequence and structure similarity to proteins in the training set (sequence identity < 20% and TM-score 0.52±0.10; Fig. 2B bottom), indicating that the model can be used to generate protein sequences and structures highly distinct from those that exist in nature. ESM3 is able to follow complex prompts, and has the ability to compose prompts from different tracks, and at different levels of abstraction. To evaluate this ability, we prompt ESM3 with motifs that require the model to solve for spatial coordination of individual atoms, including ones requiring tertiary coordination between residues far apart in the sequence, such as catalytic centers and ligand binding sites. 4 PREVIEWSimulating 500 million years of evolution with a language model Figure 2. Generative programming with ESM3. (A) ESM3 can follow prompts from each of its input tracks. Density of faithfulness to prompting for each of the tracks is shown. Generations achieve consistency with the prompt and high foldability (pTM). (B) ESM3 can be prompted to generate proteins that differ in structure (left) and sequence (right) from natural proteins. Prompted generations (blue) shift toward a more novel space vs. unconditional generations (red), in response to prompts derived from out-of-distribution natural structures (upper panel) and computationally designed symmetric proteins (lower panel). (C) ESM3 generates creative solutions to a variety of combinations of complex prompts. We show compositions of atomic level motifs with high level instructions specified through keyword or secondary structure. Fidelity to the prompt is shown via similarity to reference structure (for keyword prompts) and all-atom RMSD to the prompted structure (for atomic coordination prompts). Solutions differ from the scaffolds where the motif was derived (median TM-score 0.36 ± 0.14), and for many motifs (e.g. serotonin, calcium, protease inhibitor, and Mcl-1 inhibitor binding sites), we could find no significant similarity to other proteins that contain the same motif. (D) An example of especially creative behavior. ESM3 compresses a serine protease by 33% while maintaining the active site structure. 5 PREVIEWSimulating 500 million years of evolution with a language model We combine these with prompts that specify the fold architecture. For each unique combination of motif and scaffold, we generate samples until the prompt is satisfied (cRMSD < 1.5A for coordinates; TM ˚ > 0.6 to a representative structure for fold level prompts; and SS3 accuracy > 80% for secondary structure prompts) with high confidence (pTM 0.8, pLDDT > 0.8). We find that ESM3 is able to solve a wide variety of such tasks (Fig. 2C). It does so without retrieving the motif’s original scaffold (median TM-score of 0.40 ± 0.10 to reference protein; Appendix A.3.9). In some cases, the scaffolds are transferred from existing proteins which have similar motifs (for example, the ESM3-designed alpha-helical scaffold for the zinc-binding motif has high similarity to Ni2+-binding proteins, PDB: 5DQW, 5DQY; Fig. 2C, row 3 column 1). For many motifs (e.g., binding sites for serotonin, calcium, protease inhibitor, and Mcl-1 inhibitor) Foldseek (39) finds no significant similarity to other proteins that contain the same motif. In these cases we observe that sometimes the motif has been grafted into entirely different folds (e.g. a protease inhibitor binding site motif in a beta-barrel which is most similar to a membrane-bound copper transporter, PDB: 7PGE; Fig. 2C, row 3 column 3). At other times, the scaffold appears to be entirely novel, such as an alpha/beta protein designed to scaffold the Mcl-1 inhibitor binding motif, which has low structural similarity to all known proteins in the PDB, ESMAtlas, and the AlphaFold databases (max. TM-score < 0.5; Fig. 2C, row 4 column 1). Overall, the generated solutions have high designability, i.e. confident recovery of the original structure after inverse folding with ESMFold (median pTM 0.80 ± 0.08; scTM 0.96 ± 0.04; Appendix A.3.9). Through experiments with prompt engineering, we have observed especially creative responses to prompts. Here, we highlight an example of protein compression. Starting from a natural trypsin (PDB 1Y3V), we prompt with the sequence and coordinates of the catalytic triad as well as functional keywords describing trypsin, but reduce the overall generation length by a third (from 223 to 150 residues). ESM3 maintains the coordination of the active site (cRMSD 0.73A) and the overall fold with high designability (pTM ˚ 0.84, scTM mean 0.97, std 0.006), despite the significant reduction in sequence length and the fold only being specified by the function keyword prompt (Fig. 2D). These examples illustrate ESM3’s ability to find creative solutions to prompts specified in any of its input tracks, individually or in combination. This capability enables a rational approach to protein design, providing control at various levels of abstraction, from high-level topology to atomic coordinates, using a generative model to bridge the gap between the prompt and biological complexity. Biological alignment While we have observed meaningful increases in performance in the base models with scale, larger models could have even greater latent capabilities that we do not observe. The base ESM3 models can be prompted to perform difficult tasks such as atomic coordination and composition of prompts, despite the fact that the models have not been explicitly optimized for these objectives. Likewise, the properties we evaluate generative outputs on—such as high pTM, low cRMSD, and adherence to multimodal prompting—are only seen by the model indirectly during pre-training. Aligning the model directly to these tasks with finetuning could elicit even greater capability differences with larger models. We study how the base models can be aligned (40) to generate proteins that satisfy challenging prompts. To do this, for each model we construct a dataset of partial structure prompts, generate multiple protein sequences for each prompt, and then fold and score each of the sequences using ESM3 for consistency with the prompt (cRMSD) and foldability (pTM). High quality samples are paired with low quality samples for the same prompt to construct a preference dataset (Appendix A.4). ESM3 is then tuned to optimize a preference tuning loss, which incentivizes the model to put higher likelihood on the high quality samples compared to low quality samples (Appendix A.4) (41, 42). After aligning the ESM3 1.4B, 7B, and 98B base models, we evaluate their absolute performance, and the shift in the distribution of generations. To measure consistency of a generation with a prompt, the generated sequence is folded and success is measured based on structural metrics (backbone cRMSD < 1.5A) and foldability (pTM ˚ > 0.8). To ensure that the model used for evaluation is orthogonal to that used for creating the preference dataset, we conduct these evaluations using ESMFold. We examine the ability of the model to generate highquality scaffolds using challenging tertiary motif scaffolding prompts. We prompt ESM3 with the amino acid identities and atomic coordinates of residues derived from a dataset of 46 ligand binding motifs in a set of temporally held out proteins (Appendix A.4.5). For each motif task, we create 1024 prompts by permuting the order of the residues, varying their position in the sequence, and varying the length of the sequence. A single protein is generated per prompt. We evaluate success using the percentage of tasks solved (backbone cRMSD < 1.5A, pTM ˚ > 0.8) after 128 generations (Appendix A.4.5). Preference tuned models solve double the atomic coordination tasks compared to base models (Fig. 3A). While the base models show differences in the fraction of tasks solved (9.5% for 1.4B, 19.0% for 7B, 26.8% for 98B; Fig. 3A), a much larger capability difference is revealed through align6 PREVIEWSimulating 500 million years of evolution with a language model Figure 3. The ability to solve complex tasks increases with scale through alignment. ESM3 is aligned to follow atomic coordination prompts with a dataset of preference pairs constructed from prompted generations, where positive samples with good scores for desired properties (high pTM, low cRMSD) are paired with negative samples with worse scores. The preference tuning loss encourages the model to put higher likelihood on the positive samples. After training, models are evaluated by prompting with coordinates in tertiary contact. (A) We show the effect of finetuning on the fraction of tasks solved with 128 generations (Pass@128). A large gap opens between the models with scale. The response to alignment shows a latent capability to solve complex tasks in the largest model. Error bars show 2 standard deviations. (B) Number of distinct solutions (clustered at TM > 0.8) generated for each tertiary motif. After finetuning we often see a number of unique structures for ligands for which we have successes. (C) Densities of prompted generations are shown for the base model (left) and aligned model (right) at the 98B scale for a number of randomly selected ligands. After alignment, the fidelity to the prompt (cRMSD) and quality of generations (pTM) tends to improve meaningfully. ment (9.5% to 18.8%, 19.0% to 37.4%, 26.8% to 65.5% for the 1.4B, 7B and 98B models, respectively). Preferencetuned models not only solve a greater proportion of tasks, but also find a greater number of solutions per task, as evaluated by the number of distinct structural clusters (TM > 0.8) with backbone cRMSD < 1.5Aand pTM ˚ > 0.8 (Fig. 3B). A shift in the distribution of ESMFold pTM and backbone cRMSD for each ligand binding motif is observed (Fig. 3C; Fig. S17). At the 98B scale, the finetuned model produces more distinct successful clusters than the base model on 37 of the 46 tested ligands, while the remaining 9 ligands were not solved by either the base or aligned model, indicating that alignment almost universally improves the faithfulness to the prompt and the foldability of the generated proteins. Compared to a supervised finetuning baseline, which only maximizes the likelihood of the positive examples, preference tuning leads to larger improvements at all scales (Appendix A.4.6). These results demonstrate that preference tuning extracts latent capability in the models. The capability of larger models to solve challenging tasks become far more apparent after alignment. Since alignment can be performed with arbitrary objectives, this is an indication of a general ability to respond to finetuning that greatly improves with scale.
We sought to understand if the base pre-trained ESM3 model has sufficient biological fidelity to generate functional proteins.
We set out to create a functional green fluorescent protein (GFP) with low sequence similarity to existing ones. We chose the functionality of fluorescence because it is difficult to achieve, easy to measure, and one of the most beautiful mechanisms in nature. Responsible for the fluorescence of jellyfish and the vivid colors of coral (43), proteins in the GFP family are unique in their ability to form a fluorescent chromophore without cofactors or substrates (27).
This property allows the GFP sequence to be inserted into the genomes of other organisms to visibly label molecules, cellular structures, or processes, providing a foundational toolkit that has been broadly applied across the biosciences. The GFP family has been the subject of decades of protein engineering efforts, but still the vast majority of functional variants have come from prospecting the natural world.
Rational design and machine learning-assisted highthroughput screening have yielded GFP sequences with improved properties—such as higher brightness or stability, or differently colored variants—that incorporated small numbers of mutations (typically 5 to 15, out of the total 238 amino acid coding sequence) from the originating sequence. Studies have shown that only a few random mutations reduces fluorescence to zero (44–46). whereas in rare cases, leveraging high throughput experimentation, scientists have been able to introduce up to 40-50 mutations i.e. a 20% difference in total sequence identity (44, 47, 48) while retaining GFP fluorescence.
Generating a new GFP would require materialization of the complex biochemistry and physics that underlie its fluorescence. In all GFPs, an autocatalytic process forms the chromophore from three key amino acids in the core of the protein. The unique structure of GFP, a kinked central alpha helix surrounded by an eleven stranded beta barrel
Figure 4.
Generating a new fluorescent protein with a chain of thought.
(A) We prompt ESM3 with the sequence and structure of residues required for forming and catalyzing the chromophore reaction, as well as the structure of part of the central alpha helix from a natural fluorescent protein (left).
Through a chain of thought, ESM3 generates design candidates (right).
(B) ESM3 found a bright GFP distant from other known GFPs in two experiments.
We measured fluorescence in E.
coli lysate.
Top row, photograph of plates.
Bottom row, plate reader fluorescence quantification.
Positive controls of known GFPs are marked with purple circles, negative controls with no GFP sequence or no E.
Coli are marked with red circles.
In the first experiment (left) we expressed designs with a range of sequence identities. A notable design with low sequence identity to known fluorescent proteins appears in the well labeled B8 (highlighted in a black circle bottom, white circle top).
We continue the chain of thought from the protein in B8 for the second experiment (right).
A bright design appears in the well labeled C10 (black circle bottom, white circle top) which we designate esmGFP.
(C) esmGFP exhibits fluorescence intensity similar to common GFPs.
Normalized fluorescence is shown for a subset of proteins in experiment 2.
(D) Excitation and emission spectra for esmGFP overlaid on the spectra of EGFP.
(E) Two cutout views of the central alpha helix and the inside of the beta barrel of a predicted structure of esmGFP.
The 96 mutations esmGFP has relative to its nearest neighbor, tagRFP, are shown in blue.
(F) Cumulative density of sequence identity between fluorescent proteins across taxa.
esmGFP has the level of similarity to all other FPs that is typically found when comparing sequences across orders, but within the same class.
(G) Evolutionary distance by time in millions of years (MY) and sequence identities for three example anthozoa GFPs and esmGFP.
(H) Estimator of evolutionary distance by time (MY) from GFP sequence identity.
We estimate esmGFP is over 500 million years of natural evolution removed from the closest known protein. 8 PREVIEWSimulating 500 million years of evolution with a language model with inward facing coordinating residues, enables this reaction (49).
Once formed, the chromophore must not just absorb light but also emit it in order to be fluorescent.
Light emission is highly sensitive to the local electronic environment of the chromophore.
For these reasons, obtaining a new functional GFP would require precise configuration of both the active site and the surrounding long range tertiary interactions throughout the beta barrel. In an effort to generate new GFP sequences, we directly prompt the base pretrained 7B parameter ESM3 to generate a 229 residue protein conditioned on the positions Thr62, Thr65, Tyr66, Gly67, Arg96, Glu222, which are critical residues for forming and catalyzing the chromophore reaction (Fig.
4A).
We additionally condition on the structure of residues 58 through 71 from the experimental structure in 1QY3, which are known to be structurally important for the energetic favorability of chromophore formation (50). Specifically, sequence tokens, structure tokens, and atomic coordinates of the backbone are provided at the input and generation begins from a nearly completely masked array of tokens corresponding to 229 residues, except for the token positions used for conditioning. We generate designs using a chain-of-thought procedure as follows.
The model first generates structure tokens, effectively creating a protein backbone.
Backbones that have sufficiently good atomic coordination of the active site but differentiated overall structure from the 1QY3 backbone pass through a filter to the next step of the chain.
We add the generated structure to the original prompt to generate a sequence conditioned on the new prompt.
We then perform an iterative joint optimization, alternating between optimizing the sequence and the structure.
We reject chainsof-thought that lose atomic coordination of the active site (Appendix A.5.1).
We draw a computational pool of 10s of thousands of candidate GFP designs from the intermediate and final points in the iterative joint optimization stage of the generation protocol.
We then bucket the designs by sequence similarity to known fluorescent proteins and filter and rank designs using a variety of metrics (details in Appendix A.5.1.5).
We performed a first experiment with 88 designs on a 96 well plate, with the top generations in each sequence similarity bucket.
Each generated protein was synthesized, expressed in E. coli, and measured for fluorescence activity at an excitation wavelength of 485 nm Fig. 4B left. We measured brightness similar to positive controls from a number of designs that have higher sequence identity with naturally occurring GFPs.
We also identify a design in well B8 (highlighted in a black circle) with only 36% sequence identity to the 1QY3 sequence and 57% sequence identity to the nearest existing fluorescent protein, tagRFP.
This design was 50x less bright than natural GFPs and its chromophore matured over the course of a week, instead of in under a day, but it presents a signal of function in a new portion of sequence space that to our knowledge has not been found in nature or through protein engineering.
We continue the chain of thought starting from the sequence of the design in well B8 to generate a protein with improved brightness, using the same iterative joint optimization and ranking procedure as above.
We create a second 96 well plate of designs, and using the same plate reader assay we find that several designs in this cohort have a brightness in the range of GFPs found in nature.
The best design, located in well C10 of the second plate (Fig. 4B right), we designate esmGFP.
We find esmGFP exhibits brightness in the distribution of natural GFPs.
We evaluated the fluorescence intensity at 0, 2, and 7 days of chromophore maturation, and plot these measurements for esmGFP, a replicate of B8, a chromophore knockout of B8, along with three natural GFPs avGFP, cgreGFP, ppluGFP (Fig. 4C). esmGFP takes longer to mature than the known GFPs that we measured, but achieves a comparable brightness after two days.
To validate that fluorescence was mediated by the intended Thr65 and Tyr66, we show that B8 and esmGFP variants where these residues were mutated to glycine lost fluorescence activity (Fig.S21).
Analysis of the excitation and emission spectra of esmGFP reveals that its peak excitation occurs at 496 nm, which is shifted 7 nm relative to the 489 nm peak for EGFP, while both proteins emit at a peak of 512nm (Fig. 4D).
The shapes of the spectra indicated a narrower full-widthhalf-maximum (FWHM) for the excitation spectrum of esmGFP (39mm for esmGFP vs 56 nm for EGFP), whereas the FWHM of their emission spectra were highly comparable (35nm and 39 nm, respectively).
Overall esmGFP exhibits spectral properties consistent with known GFPs.
We next sought to understand how the sequence and structure of esmGFP compares to known proteins.
A BLAST (51) search against the non-redundant protein sequences database and an MMseqs (52) search of ESM3’s training set report the same top hit tagRFP, which was also the nearest neighbor to B8—with 58% sequence identity, representing 96 mutations throughout the sequence. tagRFP is a designed variant, and the closest wildtype sequence to esmGFP from the natural world is eqFP578, a red fluorescent protein, which differs from esmGFP by 107 sequence positions (53% identity). Sequence differences between esmGFP and tagRFP occur throughout the structure (Fig. 4E) with 22 mutations occurring in the protein’s interior, which is known to be intensely sensitive to mutations due to chromophore proximity and a high density of interactions (46).
Examination of a sequence alignment of 648 natural and designed GFP-like fluorescent proteins revealed that esmGFP has the level of similarity to all other FPs that is typically found when comparing sequences across taxonomic orders, but within the same taxonomic class (Fig. 4F). For example, the difference of esmGFP to other FPs is similar to level of difference between FPs belonging to the orders of scleractinia (stony corals) and actiniaria (sea anemones) both of which belong to the larger class anthozoa of marine invertebrates (Fig. 4G). The closest FPs to esmGFP come from the anthozoa class (corals and anemones), average sequence identity 51.4%, but esmGFP also shares some sequence identity with FPs from the hydrozoa (jellyfish) where the famous avGFP was discovered, average sequence identity 33.4% (Fig. S22).
We can draw insight from evolutionary biology on the amount of time it would take for a protein with similar sequence identity to arise through natural evolution. In Fig. 4G we show esmGFP alongside three Anthozoan GFPs. We use a recent time-calibrated phylogenetic analysis of the Anthozoans (53) that estimated the millions of years ago (MYA) to last common ancestors to estimate evolutionary time between each pair of these species. Using a larger dataset of six Anthozoan GFPs and species for which we have accurate MYA to last common ancestors and GFP sequence identities, we construct a simple estimator that correlates sequence identity between FPs to MY of evolutionary time between the species (Fig. 4H) to calibrate against natural evolution. Based on this analysis we estimate esmGFP represents an equivalent of over 500 million years of evolution from the closest protein that has been found in nature.
We have found that language models can reach a design space of proteins that is distant from the space explored by natural evolution, and generate functional proteins that would take evolution hundreds of millions of years to discover. Protein language models do not explicitly work within the physical constraints of evolution, but instead can implicitly construct a model of the multitude of potential paths evolution could have followed. Proteins can be seen as existing within an organized space where each protein is neighbored by every other that is one mutational event away (54). The structure of evolution appears as a network within this space, connecting all proteins by the paths that evolution can take between them. The paths that evolution can follow are the ones by which each protein transforms into the next without the collective loss of function of the system it is a part of. It is in this space that a language model sees proteins. It sees the data of proteins as filling this space, densely in some regions, and sparsely in others, revealing the parts that are accessible to evolution. Since the next token is generated by evolution, it follows that to solve the training task of predicting the next token, a language model must predict how evolution moves through the space of possible proteins. To do so it will need to learn what determines whether a path is feasible for evolution. Simulations are computational representations of reality. In that sense a language model which can predict possible outcomes of evolution can be said to be a simulator of it. ESM3 is an emergent simulator that has been learned from solving a token prediction task on data generated by evolution. It has been theorized that neural networks discover the underlying structure of the data they are trained to predict (55, 56). In this way, solving the token prediction task would require the model to learn the deep structure that determines which steps evolution can take, i.e. the fundamental biology of proteins. In ESM3’s generation of a new fluorescent protein, it is the first chain of thought to B8 that is the most intriguing. At 96 mutations to B8’s closest neighbor there are 229 96 × 1996 possible proteins, an astronomical number out of which only a vanishingly small fraction can have function, since fluorescence falls off sharply even after just a few random mutations. The existence of C10 and other bright designs in the neighborhood of B8 confirms that in the first chain of thought to B8, ESM3 found a new part of the space of proteins that, although unexplored by nature, is dense with fluorescent proteins. ACKNOWLEDGEMENTS We thank Eric Schreiter, Karel Svoboda, and Srinivas Turaga for feedback on the properties of esmGFP. We thank Marko Iskander, Vishvajit Kher, and the Andromeda cluster team for support on compute infrastructure. We thank April Pawluk for assistance with manuscript preparation. We also thank the experts who provided feedback on our approach to responsible development, and the experts who participated in the review of the risks and benefits of releasing ESM3-open. CONTRIBUTIONS Data: H.A., Z.L., R.R., A.R., T.S., N.T., R.V. Pre-training: H.A., S.C., J.D., T.H., Z.L., D.O., R.R., A.R., T.S., I.S., R.V., M.W. Post-training: H.A., S.C., A.D., J.G., T.H., D.O., R.R., A.R., M.W. Evaluation and Analysis: R.B., J.D., A.D., T.H., Y.K., C.K., Z.L., R.S.M., A.R., N.J.S. Open Model & Responsible Development: J.G., I.S., 1 PREVIEWSimulating 500 million years of evolution with a language model N.J.S., T.S., R.S.M., Z.L., R.R., A.R., N.T. API & Deployment: J.G., C.M., R.S.M., Z.L., T.S. GFP Computational: S.C., T.H., N.J.S., A.R., R.V.
GFP Experimental Validation: L.J.B., P.D.H., Y.K., N.J.S., N.T., V.Q.T.
Liam J. Bartie 2 Patrick D. Hsu 2 3 Yousuf Khan 4 Nicholas James Sofroniew 1 Neil Thomas 1 Vincent Quy Tran 2 3
COMPETING INTERESTS Authors H.A., R.B., S.C., J.D., A.D., J.G., T.H., C.K., Z.L., R.S.M., C.M., D.O., R.R., A.R., N.J.S., T.S., I.S., N.T., R.V., M.W. are employees of EvolutionaryScale, PBC. P.D.H. is a cofounder of Stylus Medicine, Circle Labs, and Spotlight Therapeutics, serves on the board of directors at Stylus Medicine, is a board observer at EvolutionaryScale, Circle Labs, and Spotlight Therapeutics, a scientific advisory board member at Arbor Biosciences and Veda Bio, and an advisor to NFDG, Varda Space, and Vial Health. Patents have been filed related to aspects of this work. MODEL AND DATA AVAILABILITY Weights and code for ESM3-open are provided for academic research use. The ESM3-open model was reviewed by a committee of technical experts who found that the benefits of releasing the model greatly outweighed any potential risks. ESM3 models will be available via API with a free access tier for academic research. The sequence of esmGFP (along with the other GFPs generated for this work) is committed to the public domain. Plasmids for esmGFP-C10 and esmGFP-B8 will be made available. References [1] UniProt Consortium. Uniprot: a hub for protein information. Nucleic acids research, 43(D1):D204– D212, 2015. [2] Igor V Grigoriev, Henrik Nordberg, Igor Shabalov, Andrea Aerts, Mike Cantor, David Goodstein, Alan Kuo, Simon Minovitsky, Roman Nikitin, Robin A Ohm, et al. The genome portal of the department of energy joint genome institute. Nucleic acids research, 40(D1):D26–D32, 2012. [3] Alex L Mitchell, Alexandre Almeida, Martin Beracochea, Miguel Boland, Josephine Burgin, Guy Cochrane, Michael R Crusoe, Varsha Kale, Simon C Potter, Lorna J Richardson, Ekaterina Sakharova, Maxim Scheremetjew, Anton Korobeynikov, Alex Shlemov, Olga Kunyavskaya, Alla Lapidus, and Robert D Finn. MGnify: the microbiome analysis resource in 2020. Nucleic Acids Research, 48(D1): D570–D578, January 2020. ISSN 0305-1048. doi: 10.1093/nar/gkz1035. URL https://doi.org/ 10.1093/nar/gkz1035. [4] Mihaly Varadi, Damian Bertoni, Paulyna Magana, Urmila Paramval, Ivanna Pidruchna, Malarvizhi Radhakrishnan, Maxim Tsenkov, Sreenath Nair, Milot Mirdita, Jingi Yeo, Oleg Kovalevskiy, Kathryn Tunyasuvunakool, Agata Laydon, Augustin Zˇ´ıdek, Hamish Tomlinson, Dhavanthi Hariharan, Josh Abrahamson, Tim Green, John Jumper, Ewan Birney, Martin Steinegger, Demis Hassabis, and Sameer Velankar. AlphaFold Protein Structure Database in 2024: providing structure coverage for over 214 million protein sequences. Nucleic Acids Research, 52(D1): D368–D375, January 2024. ISSN 1362-4962. doi: 10.1093/nar/gkad1011. [5] Zeming Lin, Halil Akin, Roshan Rao, Brian Hie, Zhongkai Zhu, Wenting Lu, Nikita Smetanin, Robert Verkuil, Ori Kabeli, Yaniv Shmueli, et al. Evolutionary-scale prediction of atomic-level protein structure with a language model. Science, 379(6637): 1123–1130, 2023. [6] Ethan C Alley, Grigory Khimulya, Surojit Biswas, Mohammed AlQuraishi, and George M Church. Unified rational protein engineering with sequence-based deep representation learning. Nature Methods, 16 (12):1–8, 2019. [7] Alexander Rives, Joshua Meier, Tom Sercu, Siddharth Goyal, Zeming Lin, Jason Liu, Demi Guo, Myle Ott, C Lawrence Zitnick, Jerry Ma, et al. Biological structure and function emerge from scaling unsupervised learning to 250 million protein sequences. Proceedings of the National Academy of Sciences, 118(15):e2016239118, April 2021. ISSN 0027-8424, 1091-6490. doi: 10.1073/pnas.
A.1.7. Structure Tokenizer Each residue is associated with one of 4,096 structure tokens (+4 special tokens), designed to provide a rich, learned representation of its local neighborhood.
The tokens are generated with a VQ-VAE encoder, with a corresponding decoder to enable decoding of generated tokens back to 3D coordinates. A.1.7.1.
The VQ-VAE encoder fenc consists of two geometric attention blocks (Transformer blocks, but self-attention replaced with geometric_mha) with an embedding width of 1024 and 128 geometric heads per geometric attention layer. The VQ-VAE encoder reasons over the backbone frames 26 PREVIEWSimulating 500 million years of evolution with a language model and the relative sequence position of residues in the local structure.
Relative sequence positions are encoded through a learned positional embedding.
Sequence positions are determined relative to the query residue (i.e., if the query residue has residue index 56, then the residue in index 58 has a +2 sequence position).
Relative sequence positions are clamped to +/- 32 before encoding, meaning long-range contacts share sequence positional embeddings.
Relative sequence positional embeddings define the initial encoder state N, and has shape L × 16 × d (Algorithm 7, line 4). Note that this means the input to the VQ-VAE encoder is purely structural: no sequence (amino acid), function or other information is used here.
Furthermore, each neighborhood is processed completely independently; for each residue, the encoder only uses the information of its 16 nearest neighbors. Geometric attention blocks operate similar to Transformer blocks in that they transform a state according to an attention operation ( geometric_mha ) and feedforward network (SwiGLU MLP).
As such, the output has the same shape as the input.
In this case, this means that the encoder outputs 16 latents per residue.
However, we want to learn a single token, i.e., a single latent per residue, hence we take the embedding corresponding to the query residue position N:,0,: . The process of generating structure tokens (Algorithm 7) from the full 3D coordinates of the protein then is as follows: 1.
Local Neighborhood: For each residue, obtain the indices Nidx ∈ {0..L−1} L×16 of the 16 nearest residues (as measured by Cα distance).
The first of the 16 neighbors is always the residue itself.
We also obtain the frames for each residue in a local neighborhood with Tknn.
Embed Neighbors: Embed the relative distance in sequence space for each neighbor, ∆i = clamp(Nidx − i, −32, 32) to form N ∈ R L×16×d .
Encode: Pass N through a shallow encoder fenc consisting of 2 Transformer blocks, with regular multihead self-attention swapped with geometric_mha . The attention is unmasked, all-to-all over the entire neighborhood.
Quantize: Extract the first element N:,0,: from the neighborhood, which corresponds to the residue itself.
Project it linearly, and quantize by replacing with the nearest vector in a codebook.
This yields the structure token per residue.
Algorithm 7 structure_encode Input: xCα ∈ R L×3 , T ∈ SE(3)L 1: Nidx = knn(xCα ) ▷{0..L − 1} L×16 2: Tknn = T[Nidx] ▷SE(3)L×16 3: ∆i = clamp(Nidx − i, −32, 32) 4: N = embed(∆i) ▷R L×16×d 5: N = fenc(N, Tknn) ▷R L×16×d 6: z = Linear(N:,0,:) ▷R L×d ′ 7: z = quantize (z) ▷{0..4095} L×16
A.1.7.1.1. Codebook Learning quantize transforms the L latents into L discrete tokens. Since the VQ-VAE was initially proposed (67), numerous approaches and tricks have been developed to address issues with poor codebook utilization and unstable training. We chose to learn the codebook as an exponential moving average of encoder outputs (67–69). To improve codebook utilization, unused codes are re-initialized to encoder outputs. A.1.7.1.2. Parallel Encoding To improve training and inference efficiency, we encode all local structure graphs within a protein in parallel. In practice, this means that given a batch of B proteins with average sequence length L, then the inputs to the structure encoder will have shape BL × 16 × d. A.1.7.2. DECODER While the encoder independently processes all local structures in parallel, the decoder fdec attends over the entire set of L tokens to reconstruct the full structure. It is composed using a stack of bidirectional Transformer blocks with regular self-attention. As discussed in Appendix A.1.7.3, the VQ-VAE is trained in two stages. In the first stage, a smaller decoder trunk consisting of 8 Transformer blocks with width 1024, rotary positional embeddings, and MLPs is trained to only predict backbone coordinates. In the second stage, the decoder weights are re-initialized and the network size is expanded to 30 layers, each with an embedding dimension of 1280 (∼600M parameters) to predict all atom coordinates. The exact steps to convert structure tokens back to 3D allatom coordinates using the decoder is provided in Algorithm 8 and detailed as follows,
1 − 1 n k with a cluster of size n and k items selected. Computing this on the size of each cluster and number of dataset repeats results in the approximate number of tokens we present as presented in Table S4. Our largest model is trained on all of this data, while our smaller models use a portion of it depending on the model’s token budget. A.2.2. Pre-training Tasks A.2.2.1. NOISE SCHEDULE In the masked generative framework, corruption is applied to each input to the model. To enable generation, the amount of noise applied to an input is sampled from a distribution with probability mass on all values between 0 and 1. We select various noise schedules for different tracks with several goals in mind. First, ESM3 should see all combinations of tracks as input and output, enabling it to generate and predict based on arbitrary inputs. Second, ESM3 should maintain a balance of strong representation learning and high quality generations. Third, the type of inputs provided should be representative of what users would like to prompt the model with. In initial experimentation, we found that a fixed 15% noise schedule led to poor generation results, while a linear noise schedule where probability of each mask rate was constant led to good generation but poor representation learning results. We find a good trade-off between representation learning and generation by sampling the noise schedule from a mixture distribution. 80% of the time, the mask rate is sampled from a β(3, 9) distribution with mean mask rate 25%. 20% of the time, the mask rate is sampled from a uniform distribution, resulting in an average overall mask rate of 30%. The noise schedules applied to each input are listed in Table S6. For the structure coordinate track, we also modify the noise to be applied as span dropping, as opposed to i.i.d over the sequence with 50% probability. This ensures that the model sees contiguous regions of masked and provided coordinates, which better mimics the types of inputs users may provide. A.2.2.2. TRACK DROPOUT Along with applying noise to each track, we want to ensure ESM3 is able to perform well when some tracks are not provided at all (e.g. to perform structure prediction when no structure is provided as input). We enable this by wholly dropping out some tracks with varying probabilities, listed in Table S6. A.2.2.3. STRUCTURE NOISE We apply gaussian noise with standard deviation 0.1 to all coordinates the model takes as input. A.2.2.4. ATOMIC COORDINATION SAMPLING An interesting use case of generative protein models involves conditioning on key structural information, such as an active site, and generating the sequence and structure of a protein that contains this information. It is possible to define an atomic coordination task as 3 residues which are mutually in contact in structure space (Cα−Cα distance < 6A˚ ), but are distant in sequence space (≥ 10 positions apart) (23). Training on this conditioning may enable the model to better perform the type of atomic coordination required for active site sampling. While this task will be sampled with some probability under the standard noise schedules, we also manually sample the task with 5% probability whenever a structure is available. If the task is sampled and a valid atomic coordination triplet is found, the structure coordinates for that triplet are shown to the model. For each residue in the triplet, the adjacent residues are also independently shown with 50% probability, which leads to a total size of between 3 and 9 residues. All other structure coordinates are masked. Normal masking is 3 PREVIEWSimulating 500 million years of evolution with a language model Dataset Type Clustering Level Expansion Level Tokens Release UniRef Sequence 70% (83M) 90% (156M) 54.6B 2023 02 MGnify Sequence 70% (372M) 90% (621M) 105.5B 2023 02 JGI Sequence 70% (2029M) - 256B All non-restricted studies available on July 30th, 2023. OAS Sequence 95% (1192M) - 132B All sequences available on July 30th, 2023. PDB Structure - (203K) - 0.054B All chains available on RCSB prior to May, 1st, 2020PDB Clustered Structure 70% (46K) 100% (100K) 0.027B AlphaFoldDB Structure 70% (36M) 90% (69M) 40.5B v4 ESMAtlas Structure 70% (87M) 90% (179M) 23.5B v0, v2023 02 Table S3. Pre-training dataset statistics. Includes number of tokens, release, and clustering level. Numbers are derived after dataset filtering. Dataset Name Unique Samples(M) Unique Tokens(M) UniRef 133 40,177 MGnify 406 65,780 JGI 2,039 265,070 OAS 203 22,363 PDB 0.2 55 AFDB 68 20,510 ESMAtlas 168 38,674 AFDB inverse folded 111 33,300 ESMAtlas inverse folded 251 57,730 Sequence 3,143 484,441 Structure 236 177,710 Annotation 539 105,957 Total unique training tokens 768,109 Table S4. Pre-training unique token statistics. Broken down by token type and dataset type. Dataset Inverse Folding Function Labels SASA Secondary Structure UniRef ✓ ✓ - - MGnify ✓ ✓ - - JGI ✗ ✗ - - OAS ✗ ✗ - - PDB ✗ ✗ ✗ ✗ AlphaFoldDB ✓ ✓ ✓ ✓ ESMAtlas ✓ ✓ ✓ ✓ Table S5. Data augmentation and conditioning information applied to each dataset. 40 PREVIEWSimulating 500 million years of evolution with a language model Track Noise Schedule Dropout Prob Sequence betalinear30 0 Structure Tokens cosine 0.25 Structure Coordinates cubic 0.5 Secondary Structure (8-class) square root 0.9 SASA square root 0.9 Function Tokens square root 0.9 Residue Annotations square root 0.9 Table S6. Noise Schedules and Dropout Probabilities. Figure S9. Visualization of noise schedules used. Left shows the probability density function of all noise schedules used. Right shows the betalinear30 distribution (which is drawn from β(3, 9) with 80% probability and a linear distribution with 20% probability) against a beta30 distribution (defined by β(3, 7)) and a linear distribution. applied to the other tracks. A.2.2.5. TERTIARY INTERFACE SAMPLING Predicting and generating binding interfaces is another important task for generative protein models. To help with this capability, we add computational data augmentation that simulates the binding interface task. We define a tertiary interface as one involving a long range contact (Cα − Cα distance < 8A˚ , ≥ 24 sequence positions). When this task is sampled (5% probability whenever a structure is present), a long range contact is found, then the chain is split into two chains, each containing one side of the contact interface. Suppose the contacting positions are given by the indices i, j. Then the first chain will contain residues between [RANDINT(1, i − 3), RANDINT(i + 3, j − 15)], while the second chain will contain residues between [RANDINT(i + 15, j − 3), RANDINT(j + 15, L)]. This ensures there is always a residue gap between the two pseudochains. A chainbreak token “—” is inserted to represent the residue gap. A.2.2.6. RESIDUE GAP AUGMENTATION To encourage the model to learn to represent residue gaps using the chainbreak token, we introduce a task which randomly splits a single chain into multiple subchains. First, a number of chains to sample is sampled from a geometric distribution with probability 0.9, up to a maximum of 9 possible chains. If the number of chains sampled is 1, no additional transformations are applied. A minimum separation of 10 residues between chains is defined. Sequence lengths of the chains along with gaps are sampled from a dirichlet distribution to maintain identically distributed sequence lengths for each chain. This transformation is applied to all samples. A.2.2.7. GEOMETRIC ATTENTION MASKING In the case that multiple chains are provided to the model from either the interface sampling or pseudo-multimer augmentation tasks, we mask the geometric attention layer to prevent the model from attending to cross-chain coordinates. This simulates tasks where the structure of individual chains is known, but the interface is unknown. A.2.3. Training Details A.2.3.1. HYPERPARAMETERS We train all models using AdamW optimizer (77), with the following hyperparameters: β1 = 0.9, β2 = 0.95. We use a weight decay of 0.01 and gradient clipping of 1.0. We employ 5K to 20K warmup steps until reaching the maximum learning rate, and utilize a cosine decay scheduler to decay LR to 10% of the maximum learning rate by the end of training. 41 PREVIEWSimulating 500 million years of evolution with a language model A.2.3.2. INFRASTRUCTURE Our training codebase uses Pytorch. We use Pytorch’s FSDP (78) implementation for data parallelism. We also use custom components from the TransformerEngine (79) library. We have made several optimizations to increase the training speed of our models. For multi-head attention uses, we use the memory efficient implementation from the xformers library (80). We also save activations that are expensive to compute during training when necessary. We employ mixed precision training, utilizing FP8, BF16, and FP32 as needed based on accuracy requirements and kernel availability throughout our network. A.2.3.3. STABILITY Scaling ESM3 to 98 billion parameters with its novel architecture, multi-modal inputs, and low precision computation requirements poses significant training stability challenges. Our model is significantly deeper than its NLP counterparts, and literature has shown that deeper networks are harder to train due to attention collapse (81). We observed training instability early in the architectural innovation phase, which we addressed through several changes. We apply layer normalization to the query and key vectors within the attention mechanism (82). We observe longer warm up helps (83). Another source of instability is the masking rate in pre-training tasks. We found that a very high masking rate is more likely to cause training divergences than a lower one, especially early in the training. Choosing a masking schedule biased towards lower mask rates improved both performance and training stability. Interestingly, the introduction of conditioning from other modalities also improves training stability, perhaps suggesting that stability is related to the degree of underspecification of a task. An incorrectly set learning rate is another source of instability. To ensure the right balance between learning effectiveness and stability, we optimized the learning rate on smaller models and scaled it according to best practices as outlined in (84, 85). We find empirically that the initialization has a small effect on model stability, and the majority of stabilization can be gained from simply scaling the learning rate at the appropriate rate. By applying the rules in both width-µP and depth-µP, we can simply scale the learning rate inversely proportional to the square root of the number of parameters, and find this results in stable training. Following these modifications, we successfully trained our 98-billion-parameter model without any issues related to training instability. A.2.3.4. STAGED TRAINING We stage training to alter dataset composition, train on longer contexts that would be too expensive for the entire pre-training, or introduce features such as the taxonomy track (A.1.9.2. A.3. MODEL EVALUATIONS ESM3 is both a generative model and a representation learning model that can be adapted for predictive tasks. In this section, we present benchmarking results for both capabilities. A.3.1. Models ESM3 models are trained at three scales—1.4B, 7B, and 98B parameters—on approximately 75B, 560B, and 1.8T training tokens, respectively. The ESM3 1.4B model, trained on 75B tokens and noted for its small size and speed, allows rapid iteration both during training and at inference. Optimal model size and number of training tokens are studied by extrapolating from a series of smaller runs, given a training compute budget, model architecture, and dataset characteristics (19, 21). After determining compute optimality for training, a variety of factors such as release frequency, amount of inference, ease of use, and usage patterns are also taken into account to determine the ideal number of tokens on which to train the model. To enable efficient inference for the benefit of the research community, we have trained two additional versions of ESM3 1.4B, named 1.4B Overtrained and 1.4B Open, which are trained on 300B tokens, far beyond their compute optimality for training. A.3.2. Data In the following benchmarks for this section, unless otherwise noted, models are evaluated on a test set of 902 proteins whose structures are temporarily held out from the ESM3 training set. The proteins were sourced from the Continuous Automated Model EvaluatiOn (CAMEO) targets released from May 1, 2020 through Aug 1, 2023 (86). For contact and structure prediction evaluations, we also evaluate on the CASP14 (71 proteins) and CASP15 (70 proteins) structure prediction benchmarks (87, 88). The CASP14 and CASP15 sets are obtained directly from the organizers. A.3.3. Representation Learning The contact prediction model is a multilayer perceptron (MLP) head that operates independently over the representations of each amino acid pair, outputting the probability 42 PREVIEWSimulating 500 million years of evolution with a language model of contact between them. We use LoRA (89) for finetuning, which is a common alternative to full weight finetuning that uses much less memory while attaining strong performance. LoRA is applied to the base model for finetuning, and the MLP along with the LoRA weights are trained end-to-end using the cross-entropy loss with respect to the ground truth contact prediction map. For the ground truth, all residues at least 6 positions apart in the sequence and within an 8A˚ Cα-Cα distance are labeled as a contact. All models are trained with LoRA rank 4, batch size 64 and a learning rate of 1e-3 for 10k steps on a mix of sequence and structure data from PDB, AlphaFold-DB, ESMAtlas, and OAS Predicted Structures. Data are sampled in a ratio of 1:3:3:0.03 from these datasets. Table S7 shows the performance on each structural test set through the metric of precision at L (P@L), which evaluates the precision of the top-L most confident predictions, where L is the length of the protein. The smallest ESM3 model, with 1.4B parameters, achieves a P@L of 0.76 ± 0.02 on the CAMEO test set, which is higher than the 3B parameter ESM2 model (0.75 ± 0.02). Furthermore, it trains on an order of magnitude less compute during pre-training (6.72× 1020 FLOPS vs. 1.8 × 1022 FLOPS), demonstrating the benefits of multimodal pre-training. A.3.4. Structure Prediction ESM3 can directly predict protein structures without additional finetuning by first predicting structure tokens, then decoding these tokens into coordinates. When predicting structure tokens, we follow the strategy outlined in Appendix A.1.10 and test both argmax decoding and full iterative decoding. For more difficult datasets, such as CASP14 and CASP15, iterative decoding has an outsized impact (see Table S8), whereas for easier datasets like CAMEO, argmax prediction is sufficient. On both the CAMEO and CASP15 datasets, argmax prediction for the 7B model is comparable to ESMFold, and iterative decoding with ESM3 98B closes the gap between ESMFold and Alphafold2. Structure prediction scaling curves as a function of training compute, are provided in Fig. S10 A.3.5. Conditional Likelihood The conditional likelihood of an output given a prompt serves as a proxy for the generative capabilities of a model. Fig. S11 and Table S9 evaluate the performance of ESM3 as a conditional generative model, using its negative log likelihood (NLL) on the test set. For each track - sequence, structure, function, SASA, and secondary structure - NLL is evaluated both unconditionally and conditioned on each of the other tracks. Figure S10. Scaling curves for structure prediction. Error bars are single standard deviations. Unlike, for example, an autoregressive model, ESM3 is a generative model over masking patterns, so is trained to predict tokens given any masking pattern. The NLL of a sample under ESM3 is given by 1 L! P o∈O 1 L PL i=1 log p(xoi |xo1 , . . . , xoi−1 ), where O is the set of all decoding orders with normalization constant Z = 1 L! . This computation is intractable (as the set of all decoding orders is exponential in length of a protein), but can be approximated by sampling a single decoding order o for each x in our dataset. At each step teacher forcing is used to replace the masked token with the ground truth token and report the mean NLL over the output tokens. There are many straightforward relationships in this data. For example, the unconditional NLL (Fig. S11, black lines) is always higher than conditional, and conditioning on full 3D structure reduces the loss on secondary structure prediction to nearly zero (1.4B: 0.24, 7B: 0.19, 98B: 0.16). Other trends may be more surprising. Conditioning on sequence results in a lower structure prediction loss than conditioning on secondary structure (98B; sequence: 3.13, secondary structure: 3.37). There are some diminishing returns to scale for the prediction of structure, function, SASA, and secondary structure. However, this diminishing is not observed for sequences, where we observe a clear loglinear relationship between pre-training FLOPS and NLL, regardless of conditioning. A.3.6. Unconditional Generation To assess the model’s unconditional generation capability, we sampled 100 protein lengths randomly from the PDB and generated 1,024 sequences for each using ESM3 98B with a constant temperature of 0.7. The sampled length distribution is shown in Fig. S13A. Structures for each sequence were predicted using ESM3 7B, and the distribution of pTM 43 PREVIEWSimulating 500 million years of evolution with a language model Model CASP14 CASP15 CAMEO ESM2 3B 0.57 (0.49 - 0.64) 0.57 (0.48 - 0.65) 0.75 (0.73 - 0.77) ESM3 1.4B 0.56 (0.48 - 0.64) 0.59 (0.50 - 0.66) 0.76 (0.74 - 0.78) ESM3 7B 0.62 (0.54 - 0.70) 0.64 (0.56 - 0.73) 0.82 (0.80 - 0.84) ESM3 98B 0.66 (0.57 - 0.74) 0.66 (0.57 - 0.75) 0.85 (0.83 - 0.86) Table S7. Precision @ L results. Measured on CASP14, CASP15 and CAMEO for the ESM3 model family. Intervals represent bootstrapped 95% confidence intervals. Iterative / O(L 3 ) Argmax / O(L 2 ) Model CAMEO CASP14 CASP15 CAMEO CASP14 CASP15 1.4B Open 0.830 0.705 0.733 0.805 0.640 0.677 1.4B Overtrained 0.846 0.714 0.750 0.825 0.651 0.700 1.4B 0.807 0.693 0.697 0.775 0.608 0.636 7B 0.870 0.742 0.764 0.852 0.607 0.726 98B 0.895 0.763 0.801 0.884 0.719 0.770 ESMFold 0.865 0.728 0.735 AlphaFold2 0.904 0.846 0.826 Table S8. Protein structure prediction results. We benchmark ESMFold, ESM3 models, and Alphafold2. Left side: ESM3 iterative inference of structure tokens conditioned on sequence. Because iterative inference is O(L 3 ) in length of a protein sequence, it is comparable to ESMFold and AF2, both of which share the same runtime complexity. Right panel: Single-pass argmax structure token given sequence. In all cases, the more difficult the dataset, the more iterative decoding appears to help - 98B has a +4.4 LDDT boost on CASP14, compared to a +1.0 LDDT boost on CAMEO. Both the Open and Overtrained models are both trained up to 200k steps. The plain 1.4B model is used for scaling comparisons, and is trained to 50k steps. Conditioning Model Sequence Structure Function SASA Secondary Structure Sequence 1.4B 2.31 1.71 2.28 1.81 2.02 7B 2.04 1.43 2.00 1.47 1.74 98 1.84 1.21 1.76 1.21 1.50 Structure 1.4B 4.09 4.98 4.93 4.39 4.42 7B 3.42 4.2 4.18 3.62 3.71 98 3.13 3.85 3.8 3.24 3.37 Function 1.4B 1.81 1.98 4.52 2.29 2.24 7B 1.22 1.47 3.75 1.67 1.70 98 0.93 1.20 3.63 1.41 1.40 SASA 1.4B 1.78 1.81 2.42 2.48 2.10 7B 1.57 1.66 2.26 2.31 1.92 98 1.46 1.56 2.15 2.23 1.82 Secondary Structure 1.4B 0.42 0.24 0.70 0.50 0.83 7B 0.31 0.19 0.57 0.31 0.6 98 0.26 0.16 0.50 0.25 0.54 Table S9. Negative log-likelihood of each track conditioned on other tracks. Each row is a model size, generating a particular modality. Each column is the conditioning. The diagonal, highlighted with italics, are the unconditional NLL of each track. We observe that indeed adding conditioning improves NLL in all cases. 44 PREVIEWSimulating 500 million years of evolution with a language model Figure S11. Conditional and unconditional scaling behavior for each track. Loss is shown on CAMEO (Appendix A.3.2 Figure S12. Distribution of pTM and pLDDT. Measured on natural (left) and generated (right) sequences under ESM3 7B structure prediction. Generated sequences show a clearly lower correlation (Pearson r 0.79 vs. 0.85) as well as a mode of sequences with high pLDDT but low pTM. Natural sequences are from the test set (Appendix A.3.2), generations are unconditional generations from ESM3 98B. 45 PREVIEWSimulating 500 million years of evolution with a language model and pLDDT are shown in Fig. S13B. ESM3 generates more high-quality structures than ESM2, which was trained using a simple MLM objective over sequence only with a fixed mask rate. Sequence similarity to the training set was computed using mmseqs2 (73) with the following parameters: --cov-mode 2 -c 0.8 -s 6.0. Proteins generated unconditionally are similar—but not identical—to proteins found in the training set (Fig. S15) and have high coverage of the training set (Fig. 1E), demonstrating that the model has properly fit the training distribution and does not exhibit mode collapse. We observe a cluster of generations with very high sequence identity to the training set; these correspond to antibody sequences, with the framework regions accounting for the high sequence identity. We use pTM for evaluating structure predictions from ESM3 instead of pLDDT. This is because pLDDT can be miscalibrated for generated structures and can overestimate the confidence of a prediction. pLDDT is biased towards local structural confidence, which can result in pathologies such as very long alpha helices with high pLDDT at all positions. pTM is a more global measure of structural confidence, and is more robust to these pathologies. Fig. S12 shows that pTM and pLDDT correlation drops for generated sequences (Pearson r: natural = 0.85, generation = 0.79), and a clear pattern of high pLDDT (> 0.8) but low pTM (< 0.6) emerges. To visualize the distribution of unconditional generations, we compute sequence embeddings by extracting the final layer outputs produced by running ESM3 7B with sequence inputs only. Protein-level embeddings are computed by averaging over all positions in the sequence to produce a 2560-dim embedding. We then project these embeddings into two dimensions using a UMAP projection (90) fit on a background distribution of 50,000 randomly sampled sequences from UniProt with minimum distance 0.1 and number of neighbors 25. Examples are selected by computing structural clusters with Foldseek-cluster (using default parameters) and sampling the example with highest ESM3 pTM from each cluster. A subset of these cluster representatives are shown in Fig. 1E. To assess whether ESM3 is biased towards particular secondary structures, we use DSSP to predict the three-class secondary structure of the high-confidence (pTM > 0.8, mean pLDDT > 0.8) generations and measure the percentage of residues that form alpha helices and beta sheets. When compared to a background distribution computed over the PDB, we find that ESM3 closely matches the secondary structure distribution of known proteins (Fig. S13D), unlike other methods which preferentially generate helical structures (14, 23, 25). Finally, to confirm that the structures predicted with high confidence by ESM3 are designable, we inverse folded and re-folded each using ESM3 7B. The majority of generations successfully re-folded with TM-score of greater than 0.8 to the hallucinated structures, demonstrating that ESM3 has high self-consistency for its own high-confidence designs (Fig. S13C). To explore alternative ways of generating proteins, we assess the quality of proteins generated by a chain-of-thought (CoT) procedure in which ESM3 7B generates the secondary structure (SS8 tokens), then the 3-D backbone coordinates (structure tokens), followed by the amino acid sequence (sequence tokens) (Fig. S14). We compare the quality of amino acid sequences generated from this CoT procedure with the above method of unconditionally directly generating amino acid sequences. We find that the CoT procedure generates sequences that have higher confidence ESM3- predicted structures than the directly-generated sequences as measured by pTM and mean pLDDT (Fig. S14A). Compared to high-confidence (pTM > 0.8, mean pLDDT > 0.8) directly-generated sequences, the high-confidence subset of CoT-generated sequences are also more designable: the CoT-generated sequences have predicted structures whose inverse folded, then re-refolded structures have higher TMscore to the originally predicted structure (Fig. S14C). The CoT-generated sequences show a small bias towards higher alpha and beta proportion compared to those generated directly (Fig. S14D). A.3.7. Prompt-following Evaluations To evaluate ESM’s ability to follow prompts, we use a set of held-out proteins as described in Appendix A.3.2. The test set is further filtered to remove proteins with length greater than 1024, which removes 7 proteins from the test set. To construct prompts for the structure coordinate, secondary structure, and SASA tracks, we sample a random span of length 15% of the original protein length. The model is then shown the corresponding track for the randomly sampled span, and is tasked with generating the sequence for the entire protein. For example, for the structure track, for a protein of length 100, we may sample a random span of 15 residues from residue 20-35. The model would then have to generate a protein sequence of length 100 conditioned on structure coordinate conditioning from residues 20-35 derived from the original test protein. This same procedure is applied for the secondary structure and SASA tracks. For the function track, we form the prompt by tokenizing the keywords form the InterProScan annotations associated with each sequence. The ESM3 7B model is used for all generations with a temperature of 0.7 and L decoding steps (where L is the length of the sequence). The model generates 64 sequences per prompt, which we use to compute pass64. To evaluate the generations, we use ESMFold to fold the sequences generated by ESM3. For the structure coordinate, secondary structure, and SASA tracks, the relevant align46 PREVIEWSimulating 500 million years of evolution with a language model Figure S13. Unconditional generation of high-quality and diverse proteins using ESM3. (A) Distribution of sequence length in the unconditional generation dataset. (B) Mean pLDDT and pTM of unconditional generations from ESM3 compared to sequences designed using the 3B-parameter ESM2 model. (C) Round-trip success rate of high-confidence generations using ESM3. Predicted structures were inverse folded to predict a new sequence and then re-folded to produce a new structure. Success was measured by a TM-score of greater than 0.8 between the original and refolded designs. (D) Secondary structure composition of unconditional generations relative to the distribution of proteins in the PDB, which is shown in gray. 47 PREVIEWSimulating 500 million years of evolution with a language model Figure S14. Generation of sequences using chain-of-thought. SS8 tokens are generated first, followed by structure tokens, then amino acid sequence with the ESM3 7B model. (A) Distribution of mean pLDDT and pTM of sequences generated by chain-of-thought (“ss8 first”) compared to directly generating the sequence (“sequence only”). (B) Sample generations of SS8 tokens and the predicted structure of its corresponding CoT sequence. (C) TM-score between predicted structures of high-confidence (pTM > 0.8, mean pLDDT
0.8) generated sequences and their corresponding inverse folded, then re-folded structures. (D) Comparison of the secondary structure composition of high-confidence generated sequences to the distribution of proteins in the PDB. 48 PREVIEWSimulating 500 million years of evolution with a language model ment metrics (backbone cRMSD, 3-class secondary structure accuracy, and SASA Spearman ρ) can be calculated on the relevant span in the ESMFold-predicted structure and the original template protein. Continuing the previous example for the structure track, we would compute the RMSD between residues 20-35 in the ESMFold structure predicted of the ESM3-generated sequence and residues 20-35 of the original test protein. For the function annotation track, we run InterProScan (38) on each generated sequence and extract function keywords from the emitted annotations. We report function keyword recovery at the protein level, computing the proportion of all function keywords in the prompt which appear anywhere in the function keywords from the InterProScan annotations of the generation. A.3.8. Steerable Design To test the ability of ESM3 to generalize beyond its training distribution under prompting, we evaluate two prompting scenarios. First, we identify proteins which were deposited in the PDB after our training cutoff (December 2020) and choose eight with TM < 0.7 to any structure in our training dataset (PDB IDs: 2JVN chain A, 2KAF chain A, 2L8K chain A, 2MJM chain A, 7ZUO chain A, 8EXF chain B). Using DSSP, we compute the residue-level SS8 and SASA for each of these proteins to prompt ESM3, masking all other tracks. We show in Fig. S15A that the generated proteins are diverse, globular, and closely follow the SS8 and SASA prompts while having no close sequence or structure neighbors in the training set. Interestingly, these proteins are not folded with high confidence or accuracy by ESMFold (mean pTM 0.44, mean TM-score to reference 0.33), suggesting that these are challenging proteins to fold. The ESM3- generated sequences have a similar confidence (mean pTM 0.45) but much higher accuracy (mean TM-score 0.64). Second, we classify the residue-level secondary structure for a set of eight symmetric protein backbones using DSSP. These proteins were previously designed using ESMFold (5, 91) and have varying secondary structure (alpha and beta) and varying symmetries (5-fold and 8-fold). Again, ESM3 is able to design these proteins successfully with high confidence (pTM > 0.8, pLDDT > 0.8) and low sequence similarity to the training set Fig. S15B. The structural similarity is moderate for these designs due to the high structural conservation of the protomer units in each design. All designs are generated using a constant temperature of 0.7 with L/2 decoding steps, where L is the protein length. We sample 256 sequences for each prompt and filter generations by pTM (> 0.8), pLDDT (> 0.8), and accuracy in satisfying the SS8 prompts (> 0.8). Final examples were selected from these filtered designs by visual inspection. Sequence similarity to the training set was computed using the same procedure as the unconditional generations, and structure similarity was computed using Foldseek (39) in TM-score mode (alignment-type 1) with sensitivity -s 7.5. A.3.9. Composing Prompts ESM3 is able to compose multimodal prompts across its input tracks—sequence, structure, SS8, SASA, and function keywords—to generate proteins with novel characteristics. To demonstrate this, we augment the standard functional motif scaffolding task (i.e., partial structure and sequence prompts) with additional conditioning to specify the type of scaffold for ESM3 to design. The functional sites comprise a combination of ligand binding sites coordinated by residues remote in sequence and those defined by short local motifs. For each motif, the coordinates and amino acid identities of all residues from the reference PDB structures are input to the model, with random shuffling and augmentation of the gaps between each active site. See Appendix A.4.5 for a description of this augmentation procedure and the specifications of the ligand-binding sites chosen. In addition to these sites, we also create a set of 12 partial sequence and structure prompts derived from conserved functional motifs (Table S10). These motifs are defined using a combination of the benchmark dataset in Watson et al. (23) and conserved sequence patterns from the Prosite database (92). The scaffold conditioning is defined using either SS8 tokens (to specify secondary structure composition) or function keywords defined by InterPro accession numbers (to specify a particular fold). For each combination of functional site and scaffold prompt, we sample between 256 and 2048 times to generate proteins with diverse and novel characteristics. All designs were generated with the 7B-parameter model, a constant temperature of 0.7, and L/2 decoding steps for a protein of length L. Secondary structure prompting. We generated proteins under four main classes of secondary structure composition: mostly alpha helices, mostly beta sheets, and mixed alphabeta proteins (split into alpha/beta, alpha/beta/alpha, and beta/alpha/beta topologies). For each generation, we prompt the model with a random set of SS8 spans up to a total length L, with mask tokens in between. For example, an all-alpha SS8 prompt for a protein of length L=20 might look like _HHHHHHHHHHH and a beta-alpha-beta prompt might look like _EEEHHHHHEE_, where H is a residue within an alpha helix and E is a residue in a beta strand. We then combine this with the augmented partial structure and sequence tracks given by a functional site motif. To increase the diversity of the scaffolds and maximize the probability of generating physically realizable prompt combinations, we generate between 256 and 1024 designs for each combination of SS8 and functional site motif. For each generation, we uniformly sample a random length L between 150 and 400. Then, we produce a set of secondary structure spans with length 5-20 residues, each separated 49 PREVIEWSimulating 500 million years of evolution with a language model Figure S15. Prompting ESM3 to generalize beyond its training distribution. (A) Proteins designed using SS8 and SASA prompts derived from recent structures in the PDB with low structural similarity to the training set. Prompts along the protein length are visualized above each generation; secondary structure is shown using three-class (alpha = blue, beta = orange, coil = gray) and SASA is shown as a line plot colored by residue index to match the cartoon below. (B) Symmetric proteins designed using SS8 prompting. Histograms show the similarity to the nearest training set protein by structure (TM-score) and sequence (sequence identity) compared to unconditional generation. Motif PDB ID Chain ID PDB Residue Identifiers ACE2 binding 6vw1 A 19-89, 319-366 Ferredoxin 6e6r A 1-44 Barstar binding 7mrx B 25-47 P53 binding 1ycr B 19-28 PD-1 binding 5ius A 63-83, 119-141 DNA-binding helix-turn-helix 1lcc A 1-52 P-loop 5ze9 A 229-243 Double EF-hand 1a2x A 103-115, 139-152 Lactate dehydrogenase 1ldb A 186-206 Renal dipeptidase 1itu A 124-147 Ubiquitin-activating enzyme E1C binding 1yov B 213-223 DNA topoisomerase 1a41 A 248-280 Table S10. Functional motif definitions for conserved regions. 50 PREVIEWSimulating 500 million years of evolution with a language model by a gap of 3-10 residues, such that the total length adds up to L. Finally, to avoid incompatibility between the partial structure and secondary structure constraints, we also mask the SS8 tokens at positions where structure is specified by the functional site prompt. Secondary structure–prompted designs was assessed by running DSSP on the designed sequence and measuring the fraction of prompted residues which were assigned the correct secondary structure. Success was determined by a pTM > 0.8, all-atom cRMSD < 1.5 for the functional site, and SS8 accuracy > 0.8. Keyword prompting. To prompt the model to generate proteins with a specific fold, we extracted the set of InterPro tags associated with a set of proteins from the CAMEO test set for which ESM3 achieved keyword recovery of greater than 80% (Fig. 2A). These tags were then converted into keywords and used to prompt the model in combination with the partial sequence and structure constraints. The list of prompts and function tags is given in Table S11. Keywordprompted designs were assessed using a self-consistency evaluation, i.e. whether the model successfully predicts any of the prompted InterPro accessions for the designed sequence. Success was determined by a pTM > 0.8, all-atom cRMSD < 2.0, and number of InterPro accessions recovered 0. We assess novelty of each motif-scaffold combinations by measuring the TM-score between the generated scaffold and the chain from which the motif is derived (Table S12). This confirms that the model is not retrieving the original motif scaffold, particularly for secondary structure–prompted scaffolds where we do not provide any explicit instructions to produce diverse designs. For the motifs derived from ligand binding residues (magnesium, serotonin, calcium, zinc, protease inhibitor 017, and Mcl-1 inhibitor YLT), we additionally use Foldseek to search the PDB for any other proteins which share that motif (as defined by BioLiP (93)), as a more stringent evaluation of novelty. For all but zincbinding and magnesium-binding motifs, Foldseek finds no significant hits at an E-value threshold of 1.0. The hits discovered for zinc and magnesium have only modest TMscore (0.76 and 0.64), demonstrating that the model still finds novel scaffolding solutions for these ligands. To assess whether the generated scaffolds are likely to be designable, we measure a self-consistency TM-score under orthogonal computational models by inverse-folding the designed structure with ESM-IF (94) (using a temperature of 0.5) and re-folding with ESMFold (5). We report the best scTM over 8 inverse folding designs in Table S12. A.3.10. Multimodal Editing Examples First, we describe the procedure for generating the protein compression example shown in Fig. 2D. A series of prompts of length 150 were constructed. The sequence and structure of the catalytic triad of trypsin (PDB 1Y3V) (H57, D102, S195) were placed in the prompt using the following procedure: three random residue numbers between 20 and 130 were sampled such that the minimum pairwise difference in position between each of the residues was no less than 20. Then, H57 from the template trypsin was placed at the lowest sampled number, D102 at the second lowest, and S195 at the largest number, thus respecting the left-to-right ordering of the catalytic triad in the template trypsin. 128 prompts were generated by this procedure. Each of these prompts was combined with a function keyword prompt derived from the template protein, specifically InterPro (38) tags IPR001254 (serine proteases, trypsin domain) and IPR009003 (peptidase S1, PA clan), to arrive at a final set of 128 prompts. The base ESM 7B model was then prompted to generate the sequence of the remaining 147 residues of the protein conditioned on the randomly placed catalytic triad sequence and structure coordinates and function keywords. L = 150 decoding steps were used with a temperature of 0.7, with 32 generations per prompt. Generations were then filtered by active site cRMSD, ESM3 pTM, and InterPro Scan keyword outputs, with the generation shown in Fig. 2D selected finally by visual inspection. Generation quality was measured using ESMFold (5) pTM of the generated sequence, in addition to self-consistency. For self-consistency, we inverse fold the ESM3-predicted structure of the generation with ESM-IF1 (94) 8 times and re-fold with ESMFold, reporting the mean and std of the TM-scores between the 8 ESMFold-predicted structures and the ESM3-predicted structure. To perform a blast search of the sequence, we use a standard Protein Blast search (51). We set the max target sequences parameter to 5000 and sort results by sequence length and sequence identity, selecting the first sequence that is a serine protease. This yields the reference WP 260327207 which is 164 residues long and shares 33% sequence identity with the generation. We showcase two further examples of protein editing. First, ESM3 is prompted to bury an exposed helix in a protein with an alternating alpha-beta sandwich fold. The prompt is constructed as follows: the prompt is of the same length as the template protein (PDB 1LBS). We identify a buried helix (mean SASA 0.32 A˚ 2 ) between residues 106-116 of the template protein. Structure coordinates from this region are placed in the prompt at the same residue indices, to prompt ESM3 to generate the same helix. This is composed with a SASA prompt of 40.0 for each of the 11 helix residues, prompting ESM3 to place this helix on the surface of the protein. Finally, we prompt with the secondary structure of 5 central beta strands surrounding the buried helix, residues 33-36, 62-65, 99-103, 125-130, and 179-182. ESM3 7B is then used to generate 512 protein sequences conditioned on this prompt using L 2 decoding steps and a temperature of 0.7. Designs are filtered by ESM3 pTM and adherence 51 PREVIEWSimulating 500 million years of evolution with a language model Scaffold Reference InterPro tags Total Length Beta propeller 8siuA IPR001680 (1-350) IPR036322 (1-350) IPR015943 (1-350) 353 TIM barrel 7rpnA IPR000652 (0-248) IPR020861 (164-175) IPR035990 (0-249) IPR013785 (0-251) IPR000652 (2-249) IPR022896 (1-249) 252 MFS transporter 4ikvA IPR011701 (1-380) IPR020846 (1-380) IPR036259 (1-380) 380 Immunoglobulin 7sbdH IPR036179 (0-116; 124-199) IPR013783 (0-206) IPR003597 (124-202) IPR007110 (0-115; 121-207) IPR003599 (6-115) IPR013106 (11-114) 209 Histidine kinase 8dvqA IPR003594 (47-156) IPR003594 (47-158) IPR004358 (118-137) IPR004358 (141-155) IPR004358 (101-112) IPR005467 (0-158) IPR036890 (4-159) IPR036890 (3-156) 166 Alpha/beta hydrolase 7yiiA IPR029058 (0-274) IPR000073 (26-265) 276 Table S11. InterPro tags extracted from CAMEO test set proteins for prompting with fold specification. Site Scaffold Novelty (TM to original) Designability (scTM) 017 beta 0.264 0.967 ACE2 alpha 0.606 0.871 CA Immunoglobulin 0.441 0.781 Double-EF-hand ab-hydrolase 0.293 0.969 MG TIM-barrel 0.328 0.980 Renal-dipeptidase alpha-beta-alpha 0.644 0.933 SRO mfs-transporter 0.345 0.992 Topoisomerase histidine-kinase 0.269 0.948 YLT alpha-beta 0.229 0.899 ZN alpha 0.567 0.996 Table S12. Novelty and designability metrics. Metrics are shown for motif scaffolds shown in Fig. 2C. Novelty is measured by computing the TM-score to the original scaffold from which the motif is derived. Designability is measured by self-consistency TM-score over eight samples by inverse folding with ESM-IF and refolding with ESMFold. All designs are distinct from their original scaffolds while retaining high designability. 52 PREVIEWSimulating 500 million years of evolution with a language model to the SASA prompt. The final generation is chosen by visual inspection. The generation is evaluated as described above (ESMFold pTM 0.71, scTM mean 0.82, std 0.045). Examining the generation, ESM3 is able to satisfy the input constraints: the generated protein maintains the structure of the helix (cRMSD 0.18 A) and the alternating alpha-beta ˚ fold (both the generation and the template have 7 strands alternating with helices), while exposing the helix motif to the surface (mean SASA 28.35 A˚ 2 ). Furthermore, the generation is structurally distinct: a Foldseek search (39) of AlphaFold-DB, ESMAtlas, and PDB in TM-align mode reveals no hit with TM-score greater than .76. We also use ESM3 to generate an idealized TIM Barrel with 11-fold symmetry. This generation is undertaken in two steps. First, we derive a secondary structure and function keyword prompt from a reference TIM Barrel (PDB 5EKY). The secondary structure of the reference protein is computed using DSSP and then idealized to construct a prompt for ESM3. To construct the secondary structure prompt, the length of each helix and strand is fixed at 7 residues. Each helix and strand region is then separated by 3 mask tokens, with a mask token appended to the N and C termini of the prompt as well. This yields a secondary structure prompt of total length 159, which is combined with a function keyword prompt derived from the reference protein: keywords are derived from IPR013785 (aldolase-type TIM barrel) and IPR000887 (KDPG/KHG aldolase). ESM3 7B is then used to generate 256 samples with L decoding steps and a temperature of 0.7. The design shown is chosen by filtering by ESM3 pTM and visual inspection. In the second step, the secondary structure prompt from the first step is expanded to contain 11 helix-strand subunits, for a total prompt length of 225 residues (4 mask tokens are now appended to the N and C termini, rather than just 1). ESM3 7B is then used to generate 256 samples with L decoding steps and a temperature of 0.7, with generations filtered by ESM3 pTM and visual inspection. The generation is evaluated as described above (ESMFold pTM 0.69, scTM mean 0.97, std 0.011). The generation is structurally distinct: a Foldseek search (39) of AlphaFold-DB, ESMAtlas, and PDB in TM-align mode reveals no hit with TM-score greater than .61. A.4. ALIGNMENT A.4.1. Algorithm Since the introduction of RLHF (40) there have been a number of algorithms developed to tune large models trained via unsupervised learning to better follow instructions and generally align their generations to user preferences (41, 42, 95, 96). We use IRPO (Iterative Reasoning Preference Optimization) due to its simplicity in implementation and good performance. The IRPO loss combines supervised finetuning with contrastive learning from preference pairs. IRPO operates on a dataset D ∼ (yw, yl , x) consisting of prompt x and a pair of completions yw (preferred) and yl (not preferred). It also operates on two separate models: the reference model πref and the current model πθ. The reference model πref is the fixed base model of the same scale, and the current model πθ is the model being optimized. LIRPO(πθ; πref) = LNLL + αLDPO = − E(x,yw,yl)∼D log πθ(yw|x) |yw| + |x| + α log σ β log πθ(yw|x) πref(yw|x) − β log πθ(yl |x) πref(yl |x) (2) The IRPO loss contains two terms. The LNLL term maximizes the log likelihood of the preferred example normalized by the length of the sequence, providing signal to reinforce the good generations from the model. The LDPO term is the contrastive preference tuning term, which increases the difference in log likelihoods between the preferred and not preferred examples while staying close to the reference model (41). The use of the reference model serves as a regularizer to prevent overfitting to the preference dataset, which can often be small. There are two hyperparameters, α and β. α weights the relative importance of the supervised with the preference loss and the β parameter controls how close we stay to the reference model: the higher the beta, the closer we stay. We minimize this loss with respect to the current model parameters θ. ESM3 is a multi-modal model so the prompt can be any combination of the input tracks of (partial) sequence, structure, and function and the generation y can be any of the output tracks. In our experiments we always generate the amino-acid sequence so this will be our running example from now on. Since an amino-acid sequence y can be generated from prompt x in many multi-step ways computing the full likelihood π(y|x) would involve integrating over all possible multi-step decoding paths. Since this is intractable, we use a surrogate that mirrors pre-training, shown in Eq. (3) and described below. log π(y|x) ≈ Em "X i∈m log p(yi |y\m, x) # (3) To approximate the likelihood of a generation y from prompt x, we mask y with a mask sampled from a linear noise schedule, prompt ESM3 with {y\m, x}, and compute the cross-entropy of ESM3 logits with the masked positions of y. During training, the same mask is used to compute the likelihoods for the reference policy vs current policy, as well as for the preferred sample vs non preferred sample. 53 PREVIEWSimulating 500 million years of evolution with a language model Figure S16. Multimodal protein editing with ESM3. (A) ESM3 exposes a buried helix in an protein while maintaining the alternating alpha-beta sandwich fold of the protein. (B) ESM3 is used in a two-step iterative edit, where first secondary structure prompting and function prompting are used to idealize a reference TIM barrel. Secondary structure prompting is then used to increase the number of subunits in the TIM barrel from 8 to 11. A.4.2. Preference Tuning Intuition Rearranging the DPO term of the loss function gives some insight into how it finetunes the model for the preference pairs. LDPO(πθ; πref) = E(x,yw,yl)∼D − log σ (−βzθ(x, yl , yw)) where zθ(x, yl , yw) = log πθ(yl |x) πref(yl |x) − log πθ(yw|x) πref(yw|x) = log πref(yw|x) πref(yl |x) − log πθ(yw|x) πθ(yl |x) The function f(z) = − log σ(−βz) = log(1 + exp(βz)) is the softplus function, and is an approximation of the hinge function; in other words f(z) = βz when z >> 0 and f(z) = 0 when z ≪ 0. Because of this property, there are two cases. In the case where log πref(yw|x) πref(yl |x)
log πθ(yw|x) πθ(yl |x) (5) f(z) is in the linear regime, so the loss function is simply maximizing the likelihood ratio log πθ(yw|x) πθ(yl|x) . In the case where log πref(yw|x) πref(yl |x) ≪ log πθ(yw|x) πθ(yl |x) (6) the loss has saturated. This ensures that we do not deviate too far from the reference model. These dynamics also hold true in the case of ESM3 finetuning. Although we use a surrogate instead of the true likelihood, the loss will increase the surrogate of the preferred pair over the non preferred pair until the current model deviates too much from the reference model. A.4.3. Evaluation Metrics Possibly the most important part of preference tuning is to decide how to bucket generations into preferences. The desired objectives for a generation are quality and correctness. Quality refers to the viability of the sequence to be a stable protein. Correctness refers to the extent to which it follows the given prompt; also called prompt consistency. This section only deals with structure coordinate prompts, so prompt consistency can be measured via constrained site RMSD (cRMSD), which is the RMSD between the prompt coordinates and the corresponding coordinates in the predicted structure of the generated sequence. Sequence quality can be measured via predicted-TM (pTM) of a structure predictor on the generated sequence. As with any metric, especially one which is really a surrogate such as a structure predictor, there is a risk of over optimizing: the model keeps improving the specific metric e.g. in our case pTM but the actual property of interest, the viability of the sequence to be a stable protein, stops correlating with the metric (97). Using orthogonal models to rank our training dataset vs to perform evaluation helps mitigate this. To create the training datasets, generations are evaluated according to cRMSD and pTM of ESM3 7B to maintain a consistent structure predictor across all datasets. After the preference tuning phase, the generations from the tuned models are evaluated with ESMFold cRMSD and pTM as 54 PREVIEWSimulating 500 million years of evolution with a language model an orthogonal model. Training on ESM3 derived metrics while evaluating on ESMFold derived metrics should reduce the risk of over optimization for adversarial generations. A.4.4. Training Dataset All ESM3 model scales are trained with the IRPO loss (Eq. (2)) on their respective preconstructed training datasets consisting of structure coordinate prompts and generations of various difficulty. The datasets have 16 generations each for 30,000 prompts from the respective ESM3 model. Preference selection is determined via a threshold of metrics. A sample is considered “good” if it has ESM3 7B pTM > 0.8 and backbone cRMSD to its structure prompt < 1.5A. ˚ Each “good” sample is paired with a “bad” sample to create a preference pair. We found that enforcing a gap between metrics of paired generations improves results, so to qualify as a “bad” sample the generation must have a delta pTM = pTMgood − pTMbad >= 0.2 and delta backbone cRMSD = cRMSDgood − cRMSDbad < −2A˚ . Each prompt can have multiple preference pairs, and prompts with no valid preference pair are discarded. The structure prompts are composed of a variety of proteins adapted from our pre-training pipeline. 50% of the prompts are synthetic active sites, while the other 50% are structure coordinates randomly masked with a noise schedule. All of the structure prompts are derived from PDB structures with a temporal cutoff of before May 1st, 2020. The synthetic active sites are derived by finding sequences from PDB with coordinating residues. For these structures, the amino acid identities are included in the prompt. The remaining structure track prompts are masked according to a cosine noise schedule. 50% of the noise scheduled prompts are masked in completely random positions, and the other 50% are masked according to an autocorrelation mechanism that prefers sequentially masked positions. Each model’s training dataset consists of generations of its own reference model. For each prompt, we generate samples from the corresponding ESM3 model scale using iterative decoding with L/4 steps, where L is the length of the prompt. We anneal the temperature from 1.0 to 0.5 over the decoding steps. A.4.5. Evaluation Dataset: Atomic Coordination Atomic coordination tasks require the generation of proteins which satisfy challenging tertiary interaction constraints. The model is prompted with the sequence and coordinates of a set of residues which are near in 3D space, but distant in sequence. To evaluate performance on these tasks, we curate a dataset of 46 proteins with ligand binding sites from the Biolip dataset (93). All selected proteins were deposited in the PDB after the training set cutoff date (2020-12-01). The coordinating residues shown to the model are given by the ligand binding sites defined in the Biolip dataset (Table S13). ESM3 is prompted with the sequence and coordinates of the residues for a particular ligand binding site. We ask ESM3 to generate novel structures by applying multiple transformations to the prompt. The total sequence length is sampled evenly to be 150, 250, or 350 residues (regardless of the original sequence length). Next, we define a contiguous span of coordinating residues to be prompt residues with fewer than 5 sequence positions between them. The order and the distance between contiguous spans of residues is shuffled. Together, this ensures that, for example, the original protein will no longer satisfy the prompt. We consider a generation a success if backbone cRMSD < 1.5A˚ and pTM > 0.8. We construct a total of 1024 prompts for each ligand and generate a completion for each prompt with the model we are evaluating. We report Pass@128, which is an estimate for the fraction of ligands with at least one successful completion after 128 prompts per ligand. We estimate this using an unbiased estimator (Chen et al. (98), Page 3) using the success rate over 1024 prompts. We visualize randomly selected successful generations for both the base model and finetuned model in Fig. S18. A.4.6. Supervised Finetuning To judge the value of preference tuning, we also train a supervised finetuning (SFT) baseline where we finetune the model to increase likelihood of the high quality samples without the preference tuning loss. The 1.4B, 7B, and 98B models solve 14.2%, 33.7%, and 44.6% of atomic coordination tasks at 128 generations, respectively, which improves upon the base models but is much lower than their corresponding preference tuned versions. A.4.7. Training Hyperparameters Each IRPO model is trained for 1000 steps using RMSProp. The learning rates are 1e-5, 1e-5, and 5e-6 for the 1.4B, 7B, and 98B, respectively, annealed using a cosine schedule after a 150 step warmup. Gradient norms are clipped to 1.0. For all IRPO runs β = 0.05 and α = 0.8. The SFT baseline uses the same hyperparameters, but with α = 0.0 to disregard the preference tuning term. A.5. GFP ESM3 generates a dim distant GFP B8 and a bright distant protein esmGFP. Details are provided below on com55 PREVIEWSimulating 500 million years of evolution with a language model PDB ID Coordinating Residues Ligand ID 7map D25 G27 A28 D29 D30 G48 G49 V50 017 7n3u I305 F310 V313 A326 K328 N376 C379 G382 D386 F433 05J 7exd D103 I104 C107 T108 I174 H176 T182 W306 F309 E313 Y337 05X 8gxp W317 C320 A321 H323 V376 F377 L396 I400 H479 Y502 06L 7n4z M66 C67 R124 L130 C134 Y135 D152 F155 08N 7vrd A40 S41 H161 Q169 E170 E213 D248 D324 K349 H377 R378 S379 K400 2PG 7zyk V53 V66 V116 H160 N161 I174 D175 ADP 6yj7 K23 V24 A25 Y45 T46 A47 F115 I128 AMP 8ppb H185 F198 K209 Q249 D250 L251 D262 K336 I415 D416 ATP 7knv E33 F94 E95 D125 CA 7xer Y466 L505 T525 CLR 7tj6 F366 G367 T378 R418 CMP 6xm7 H167 H218 H284 H476 CO 7bfr Q62 X126 H248 CO3 6xlr X272 Y495 H496 H581 CU 6tnh N40 A41 S127 T128 Q187 L191 C201 T202 V236 DGP 7ndr F73 S101 F102 D103 R106 EDO 8axy H68 H109 E144 FE 7o6c E62 E107 Q141 FE2 8aul P31 M32 T33 Q106 H185 R237 S319 G320 G321 G342 R343 F369 Y370 FMN 7vcp N37 D38 Q54 F97 S98 R159 D160 E214 Y276 W297 FRU 7b7f G167 T168 G189 W195 FUC 8d0w F73 L136 E137 F329 GAL 7yua T13 T14 I15 D40 H85 S86 D87 D110 N290 GDP 7w1a L44 Y88 L91 I212 GMP 7ljn G71 S72 D91 K236 S253 V254 D309 R310 GTP 6s4f Y84 N87 K88 V131 Q132 L133 D155 F157 I276 P309 G310 G313 P314 V317 KUN 7mg7 Y12 G98 L99 Y100 A207 D208 G227 R228 MAN 7qow D12 T118 E268 MG 7dmm E181 E217 D245 D287 MN 7qoz G11 G12 I13 Y34 D35 V36 A86 G87 V126 T127 N128 H185 M235 NAD 7v2r G89 F93 K98 F101 E121 Y204 E209 F229 NAI 7a7b F51 Y128 K165 N166 S167 Y186 R187 I248 G249 A299 NAP 7pae M20 L22 L38 V49 I53 C56 K57 R61 Q78 V80 W90 I109 M117 I129 L147 Y149 O7T 8egy H82 K83 S186 G230 S231 N232 E345 S368 G369 PLP 7qow S65 R129 D273 H465 PO4 7wmk E77 L124 R129 S174 T189 Q191 W241 D304 E306 K349 D410 W411 Y486 PQQ 7pl9 D607 A608 Y637 M638 Y705 G706 M735 K736 RET 7yf2 G153 E174 L175 L209 N210 L211 Y295 SAH 7v6j G207 D230 L231 D250 M251 K264 SAM 7ys6 D106 C110 N288 SRO 6w8m A22 A23 G70 S110 T111 G112 V113 Y114 TJY 8g27 S258 D294 K435 R717 UDP 7xyk R24 C170 R190 S191 D193 N201 H231 Y233 UMP 8g3s H224 F228 V249 M250 V253 R263 T266 L267 F270 YLT 8it9 T92 P93 R96 Y108 L109 K216 V228 S229 H231 H232 ZL6 Table S13. Atomic coordination dataset. Selected PDBs and coordinating residues (along with binding ligand) for each protein sample in the atomic coordination dataset. 56 PREVIEWSimulating 500 million years of evolution with a language model Figure S17. Alignment improves model generations. pTM, cRMSD distributions of generations from the 98B base model and aligned model for all ligands in the atomic coordination dataset. Each ligand/model pair has 1024 generations. 57 PREVIEWSimulating 500 million years of evolution with a language model Figure S18. Randomly selected successful generations from the base model and finetuned model. A random sample of ligands is selected and visualized with the ground truth PDB chain from which the ligand was taken. Solutions produced by ESM3 are diverse, and the finetuned model gives significantly more successes (out of 1024 total samples). 58 PREVIEWSimulating 500 million years of evolution with a language model putational methods, experimental protocols, results, and post-experiment analyses. A.5.1. Generation and Selection The base ESM3 7B model generates candidate GFP designs for laboratory testing using a single prompt and a chain of thought over sequence and structure tokens. Candidates are filtered and ranked by metrics at several steps in the process. Experiment 1 tests candidates across a range of sequence identity to a template, yielding multiple GFPs including dim hit B8. Experiment 2 consists of designs starting a chain of thought from the sequence of B8, yielding numerous bright GFPs including C10 which we term esmGFP. This section details the computational protocol that generated and selected candidate GFP designs for Experiments 1 and 2, shown in Fig. 4B. Protocols, metrics, and selection conventions are separately introduced and then synthesized in descriptions of the two experiments, at the end of the section. A.5.1.1. MODEL All candidate GFP designs were created using the base ESM3 7B model with no finetuning. Throughout generation, the model is prevented from decoding cysteine residues. A.5.1.2. PROMPT All candidate GFP designs in Experiment 1 are produced with a chain of thought beginning from a single prompt. The goal of the prompt is to capture essential residue identities and structural features needed for chromophore formation and fluorescence, leaving other degrees of freedom open for the model to generate diverse designs. Template To this end, we prompt ESM3 with a minimal set of sequence and structure information from 16 residues near the chromophore formation site from a template protein. We select a pre-cyclized intermediate crystal structure from (50), PDB ID 1QY3, as our template. We reverse the chromophore maturation slowing mutation R96A in 1QY3 so the prompt contains Arg96. We subsequently refer to the full sequence and structure of 1QY3 with mutation A96R as 1QY3 A96R or the template. Sequence prompt The sequence portion of our prompt consists of 7 template residues: Met1, Thr62, Thr65, Tyr66, Gly67, Arg96, and Glu222. Residues 65-67 form the chromophore. Met1 ensures proper start codon placement. Residues 62, 96, and 222 are described in (50) and other works to have key catalytic roles in chromophore formation. Structure prompt The structure portion of our prompt consists of structure tokens and backbone atomic coordinates taken from 16 template residues at positions 96, 222, and 58-71 (inclusive) which roughly captures the central alpha helix. The unique geometry of the central alpha helix is known to be crucial for chromophore formation (50). All other positions and tracks in the prompt are masked. The overall prompt length is 229, matching that of the template. Residue indices are contiguous and begin from 1. A.5.1.3. JOINT SEQUENCE STRUCTURE OPTIMIZATION We employ the following procedure to jointly optimize the sequence and structure of designs throughout our experiments: While annealing temperature linearly from 1 to 0, we perform multiple iterations of first predicting the structure of a designed sequence and subsequently Gibbs sampling each position in the sequence for that predicted structure. In algorithmic form: Algorithm 15 gibbsseqgivenstruct Input: ESM3 f, sequence x ∈: {0..20} L, structure y, temperature t 1: for i = shuffle({1, …, L}) do 2: xi ∼ exp log f(xi | x\i , y)/t 3: end for 4: return x Algorithm 16 jointoptimize Input: ESM3 f, initial sequence x1, iterations I, initial temperature t1, final temperature tf 1: for i = 1, . . . , I do 2: ti = (tf − t1) · (i/(I − 1)) + t1 3: yi = generate struct (f, xi , len(xi), T = 0) 4: xi+1 = gibbsseqgivenstruct (f, xi , yi , ti) 5: end for 6: return xI+1 Three variants of gibbsseqgivenstruct in joint_optimize were employed for Experiments 1 and 2. Joint optimization occasionally produces repetitive spans of amino acids when temperature is annealed to low values. Variant 1 and 2 are intended to address this, in differing ways. Variant 3 is an experiment in biasing the logits with a PSSM of known natural GFPs. Half of the candidates in Experiment 2 were produced using Variant 3. This half did not include esmGFP.
- Variant 1: Negative Local Sequence Guidance We bias the logits of the model away from those produced just from a highly local span of the sequence. Specifically, we use classifier free guidance (99): logits′ = weight∗(logitscond−logitsuncond)+logitsuncond 5 PREVIEWSimulating 500 million years of evolution with a language model but push away from the logits produced by inputting just 7 residues centered on the position being sampled, with weight 2 and nothing else. All other sequence positions and all other model inputs are left blank. logits′ = 2 ∗ (logitscond − logitslocal seq) + logitslocal seq
- Variant 2: Max Decoding Entropy Threshold We optionally skip resampling of sequence during Gibbs sampling at positions whose entropy over sequence tokens exceeds a user specified threshold.
- Variant 3: PSSM Bias In Experiment 2 only, we experiment with both including and excluding a PSSMbased bias during Gibbs sequence sampling. Specifically, we add a PSSM constructed from 71 natural GFPs (see Appendix A.5.1.4 for details) directly to the sequence output logits of the model, with a userspecific weight. esmGFP did not use this option; it was produced with weight 0. A.5.1.4. METRICS GFP designs are produced and scored by a number of ESM3- derived and independent metrics. Unless otherwise noted, designed structures are predicted using ESM3 with only sequence as input, using iterative decoding of structure tokens with temperature 0 and subsequent decoding of backbone coordinates with an older version of the structure token decoder. The following is an exhaustive list of metrics used. An exact break down of where and how specific metrics are used can be found in Appendix A.5.1.5, Appendix A.5.1.6 and Appendix A.5.1.7. Template Chromophore Site RMSD is calculated via an optimal alignment (100) of N, C, CA, and inferred CB atoms at positions 62, 65, 66, 67, 96, and 222 in the predicted structure of a design and the template (crystal) structure. Template Helix RMSD is calculated in the same way, but for N, C, CA atoms only, at design and template positions 58-71 (inclusive). 1EMA Helix RMSD is a metric proposed in (101). An RMSD is calculated between alpha helix residues in the predicted designed structure and a specific crystal structure of avGFP, PDB ID 1EMA. Our calculation differs slightly from (101). We calculate RMSD for N, C, CA and inferred O atoms, and consider only positions 60-64 and 68-74 (both ranges inclusive) to exclude chromophore positions 65-67. Sequence Pseudo-perplexity is calculated as defined in (102). Given a protein sequence, positions are masked one at a time, negative log-likelihoods of input tokens at masked positions are averaged across all positions in the sequence, and the result is exponentiated. Round-trip Perplexity is calculated for a designed sequence via predicting its structure with ESM3, and then evaluating the perplexity of the sequence given that predicted structure under a single forward pass of ESM3. N-gram Score is calculated as the Engram term defined in (10). This score assesses the divergence between the Ngram frequencies of residues in the designed sequence and those found in a background distribution, derived from UniRef50 2018 03. Specifically, for a function ngrami that takes in a sequence x and an N-gram order i, and a precomputed distribuion of background Ngram frequencies ngrami,bg, the score is calculated as: Engram = X i∈{1,2,3} DKL(ngrami (x), ngrami,bg) (7) PSSM A position-specific scoring matrix (PSSM) is constructed from a MSA of 71 natural GFPs (103). Specifically, at positions aligned to our template, frequencies for the 20 canonical amino acids (excluding gaps) are transformed to log odds via dividing by the uniform background (p(aa) = 0.05), adding an epsilon of 1e-9, and applying log base 2. This produces a matrix of scores of size 229 x 20. PSSM score We extract from the PSSM values at (position, amino acid) pairs occurring in an input sequence. These are averaged to produce a score. N-terminus Coil Count is metric intended to measure structural disorder at the N-terminus of a design. We observed that predicted structures have various levels of disorder in this region. To quantify it for possible filtering, we apply mkdssp (76) to the ESM3-predicted structure of a design, and record how many of the first 12 positions are reported as having SS8 labels in {S,T,C}. A.5.1.5. SELECTION CRITERIA Among Experiment 1 and 2, designs are selected for testing by first applying a set of filters, and then selecting the topN designs according to a score-based ranking. Scores are calculated by summing the values of several metrics, which are each normalized across designs to have zero mean and unit variance and which are negated when appropriate so that lower values are always better. Common Filters: The following filters are applied in both Experiments 1 and 2. 60 PREVIEWSimulating 500 million years of evolution with a language model • Template Chromophore Site RMSD <1.5A˚ • Template Helix RMSD <1.5A˚ • N-gram Score <5 Common Score Terms: The following score terms are used in both Experiments 1 and 2. • Sequence Pseudo-perplexity • Round-trip Perplexity • ESM3 pTM A.5.1.6. GENERATION AND SELECTION OF DESIGNS FOR EXPERIMENT 1 In this experiment, we generate a set of GFP designs for experimental testing with a range of sequence identities to our template. Designs are generated by a chain of thought: From the prompt, ESM3 decodes all masked structure tokens, then all masked sequence tokens. Lastly, sequence and structure tokens are jointly optimized. Initial Generation: Starting from the prompt, we first generate 38k structures by decoding masked structure tokens one at a time using a fixed temperature sampled uniformly from the range (0, 1.25) for each generation. To focus compute on the most promising structures, we filter according to Template Chromophore Site RMSD <1A, yielding 24k selected structures. We next gener- ˚ ate ≈ 4 sequences for each structure with a temperature uniformly sampled from the range (0, 0.6), yielding 92k total sequences. Selection: We select a subset of promising initial generations for further optimization by applying Common Filters with N-gram score’s threshold modified to <5.5, ranking designs according to {Common Score Terms, mean ESM3 pLDDT, mean ESMFold pLDDT, and ESMFold pTM}, and selecting the best 40 designs in each interval of 0.1 sequence identity to the template sequence in [0.2, 1.0], 320 in total. Joint Sequence Structure Optimization: We then jointly optimize the sequence and structure of designs. Using 30 iterations in each case, we run 5 seeds of optimization with max decoding entropy threshold = 1.5 and 2 seeds of optimization with negative local sequence guidance = 2.0, yielding 67k total designs. Designs from every iteration are included in this pool. Selection To select a set of designs for laboratory testing, we apply {Common Filters, N-terminus Coil Count <6}, rank designs according to {Common Score Terms, ESMFold pTM, 15 PSSM Score}, and select the best 88 designs across 8 buckets of sequence identity to our template among intervals of width 0.1 in range [0.2, 1]. A.5.1.7. GENERATION AND SELECTION OF DESIGNS FOR EXPERIMENT 2 In this experiment, we perform further refinement of the dim, distant GFP found in Experiment 1, B10. To produce a diversity of designs, we sweep over a number of settings: two variations of refinement are performed, and 2 selection protocols are used. Local Joint Optimization: Starting from our dim GFP design, B10, we perform joint_optimize using a full grid sweep of the following sets of settings: Initial temperatures {0.001, 0.01, 0.05, 0.1, 0.5}, PSSM bias weights {0, 0.01, 0.05, 0.1, 0.5}, Max decoding entropy thresholds {0.8, 1, 1.25, 1.5, 2.0}. For each unique settings combination, we use 20 iterations of optimization with 3 seeds, continuing the final step of Gibbs sampling until convergence. After accounting for some distributed system machine failures, this yields 6.3k total candidate designs. Selection: We select two sets of 45 designs for laboratory testing via two filters and a shared set of ranking criteria.
- Set 1: We filter according to {PSSM Bias ̸= 0, Common Filters, RMSD to starting structure <1A, Identity to starting sequence in (0.7, 1.0) ˚ }.
- Set 2: We filter according to {PSSM Bias = 0 (no bias), Common Filters, RMSD to starting structure <1A, Identity to starting sequence in (0.9, ˚ 1.0)}. esmGFP comes from this pool. For each set, we rank according to {Common Score Terms, 8 * PSSM Score, 15 * 1EMA Helix RMSD} and select 45 designs each for testing. A.5.2. Experimental Methods and Data Analysis A.5.2.1. STRAINS AND PLASMIDS We designed a custom bacterial expression vector containing an Ampicillin-resistance gene, the BBa R0040 TetR promoter, the BBa B0015 terminator, and a Bsa-I golden gate site between the promoter and terminator. GFP designs were codon optimized for E. coli expression and ordered from IDT (Integrated Device Technology Inc.) containing compatible golden gate overhangs. They were then cloned by golden gate assembly into the vector. We evaluated our GFP designs in the E. coli host Mach1. A.5.2.2. FLUORESCENCE ASSAYS OF GFP DESIGNS To evaluate the fluorescence of our GFP designs, we transformed our designs into Mach1 cells. For each of two 61 PREVIEWSimulating 500 million years of evolution with a language model replicates of a design, a colony was seeded into a 1 mL TB culture containing 50 µg/mL carbenicillin. Cultures were grown in 96 deep well blocks at 37 °C in an Infors HT Multitron Shaker with a shaking speed of 1000 RPM for 24 hours. After 24 hours, 1 µL of the cultures were diluted in 200 µl of 0.2 µm filtered DPBS. Fluorescence intensity of the samples was then quantified at the single cell level using a NovoCyte Quanteon Flow Cytometer (Fig. S19). The remaining cultures were spun down at 4000 g for 10 minutes, resuspended and lysed with 300 µL lysis buffer (1x bugbuster, 500 mM NaCl, 20 mM Tris-HCl pH 8, 10% glycerol, cOmplete™ , EDTA-free Protease Inhibitor Cocktail), incubated at room temperature on a Belly Dancer Orbital Shaker for 10 minutes, and lysate clarified by centrifugation at 4000 g for 20 minutes. 100-120 µl lysate was transferred to a 96 well black clear-bottom plate, and GFP fluorescence was measured using a Tecan Spark Reader. Fluorescence emission was captured at 515 nm with a 10 nm bandwidth and excited with 485 nm with a 10 nm bandwidth. Absorbance was captured at 280 nm with a 3.5 nm bandwidth to assess total protein content per well. For longer time points, plates containing lysate were sealed and incubated at 37°C for up to 7 days prior to measuring fluorescence. GFP fluorescence values were first ratio normalized within a well by their absorbance at 280 nm, and then further ratio normalized across wells using the measured values from a negative control E. coli containing vector without GFP. Data from two replicates was then averaged for (Fig. 4B bottom) and (Fig. 4C). Overview photos of the plates (Fig. 4B top) were taken with an iPhone 12 mini under blue light illumination from an Invitrogen Safe Imager 2.0 Blue Light Transilluminator. For excitation spectra, emission was captured at 570 nm with a 50 nm bandwidth, while the excitation wavelength was varied from 350 to 520 nm with a 10 nm bandwidth. For emission spectra, an excitation wavelength of 430 nm was used with a 50 nm bandwidth, while emission was captured at varying wavelengths from 480 to 650 nm with a 10 nm bandwidth. Excitation and emission spectra were normalized by their maximum values (Fig. 4C). A.5.2.3. ADDITIONAL GFP EXPERIMENTS Plate overview photographs (Fig. 4B top) were taken over two weeks since the initial lysate was created and over one week after the final plate reader quantification was done, and so possibly show additional brightness from slow chromophore maturing designs. We observed some low level contamination of wells H11 (vector with no GFP or designs) and H12 (lysis buffer only) in the photograph of Experiment 1 (Fig. 4B top left). Some of this contamination is already visible in well H12 during the initial plate reader quantification (Fig. 4B bottom left). To address potential contamination concerns we performed an additional replication of B8 and observed a similar level of brightness to Experiment 1 (50x less bright than natural GFPs) (Fig. S20). Chromophore knockout versions of 1QY3 A96R and esmGFP were created through additional T65G and Y66G mutations. These variants, along with 1QY3 and esmGFP, were synthesized and measured as part of an independent replicate performed by Genscript following the E. Coli based fluorescent plate reader assay described above. Normalization was performed with an OD600 measurement of the cells prior to lysis. Analysis otherwise proceeded as above. Two replicates were performed for each design and results were averaged. Chromophore knockout reduced fluorescence to background levels (Fig. S21). A.5.3. Sequence searches and comparisons A.5.3.1. DATABASE SEARCHES BLAST nr search: esmGFP’s sequence was searched with BLAST’s online server using the non-redundant sequences database nr with all default settings. tagRFP’s sequence was taken from the top hit. The exact top hit found was TagRFP [Cloning vector pLX-B2-TagRFP-T, Sequence ID ASG92118.1 and is shown in its entirety in Table S14. Train set search: MMseqs2 (73), version 15.6f452, was used to search all datasets that ESM3 was trained on at the maximum available expansion level; for cluster resampling datasets all cluster members are searched, not just cluster centers. The goal is to search against every possible sequence that ESM3 may have seen during pre-training. Settings are selected for conducting a high sensitivity search: -s 6 -a --max-seqs 10000. A.5.3.2. SEQUENCE IDENTITY CALCULATIONS To calculate sequence identities involving the two highlighted GFP designs (B8, esmGFP) and select reference proteins, the following procedure is used. MAFFT (104) v7.525 is applied with all default settings to the sequences of B8, esmGFP, the top tagRFP sequence found by BLAST, eqFP578 (from FPBase (105)), the template (PDB ID 1QY3, with mutation A96R), and avGFP (from FPBase). Identities between two sequences are calculated as the number of matching non-gap residues at aligned positions divided by the minimum non-gapped length of the query and target protein. This is the same sequence identity formula used in Appendix A.5.4. Aligned sequences and identities and mutation counts to esmGFP are provided in Table S14. 62 PREVIEWSimulating 500 million years of evolution with a language model Figure S19. Flow cytometry data confirms cells expressing esmGFP can be detected at the single cell level. Forward Scatter-Area (FSC-A), a measure of cell size vs Fluorescein Isothiocyanate-Area (FITC-A), a measure of GFP-like fluorescent signal, for expressing 1QY3 A96R, esmGFP, and a negative control that does not express any GFP. A gate was set at the 99.9% quantile for the negative control data, and the fraction of cells passing the gate were quantified for each sample. Figure S20. Replication of design B8 and select controls. Results are averages of eight wells across two plates. A.5.3.3. INNER-BARREL MUTATION COUNT Positions in esmGFP are described as internal if they have SASA < 5 in their predicted structure. SASA is calculated as in Appendix A.2.1.6) from the all-atom structure of esmGFP, predicted with ESM3 7B.
Sequences and metadata of natural and designed fluorescent proteins were obtained from FPBase (105). An initial set of 1000 proteins was filtered to protein which contained the following metadata: a specified parent organism, an amino acid sequence between 200 and 300 residues long, a specified emission maximum, and no cofactors. NCBI taxonomy database was used to obtain taxonomic information about each species. These sequences were further filtered according to keep those that had species found by NCBI and were Eukaryotic but not from Chlorophyta (to exclude Channelrhodopsin like proteins). The 648 sequences that passed these criteria, along with the sequence for esmGFP, were aligned to a multiple sequence alignement using MAFFT and sequence idenity was computed between each pair of sequences as described above. All pairs within and across taxa were considered for (Fig. 4F). All designed sequences were considered to belong to the species annotated as their parent organism.
Figure S21. Chromophore knockout mutations T65G and Y66G reduces fluorescence of both 1QY3 A96R and esmGFP to background levels. 63
PREVIEWSimulating 500 million years of evolution with a language model Figure S22. Sequence identity of esmGFP with natural and designed GFPs from the four major classes found in nature.
All 648 used sequences belonged to the Leptocardii (e.g. laGFP), Hexanauplia (e.g. ppluGFP), Hydrozoa (e.g. avGFP), or Anthrozoa (e.g. efasGFP) classes. The sequence identity of esmGFP was computed to each protein in these classes Fig. S22. esmGFP was found to be closest to Anthrozoan GFPs (average sequence identity 51.4%) but also shares some sequence identity to Hydrozoan GFPs (average sequence identity 33.4%).
To estimate the millions of years of evolutionary distance by time between esmGFP and known fluorescent proteins we built an estimator to go from sequence identity between pairs of GFPs to millions of years (MY) apart. We used the following six Anthozoan species Acropora millepora, Ricordea florida, Montastraea cavernosa, Porites porites, Discosoma sp., Eusmilia fastigiata along with the six GFPs amilGFP, rfloGFP, mcavGFP, pporGFP, dis3GFP, efasGFP respectively. These species and GFPs were chosen because they were annotated in both a recent time calibrated phylogenetic analysis of the Anthozoans (53) and a recent study of GFPs (44). Each of these species contains multiple GFP like sequences including red and cyan FPs. These particular GFPs were chosen as they were annotated to be the main GFP in each species. The millions of years between each species was estimated as twice the millions of years to the last common ancestor annotated in the time calibrated phylogenetic analysis. Using statsmodels (106), a line of best fit was fit between MY and sequence identity. The line was required to pass through a sequence identity of 1.0 and 0 MY. The MY to esmGFP was then estimated using this line and the sequence identity of esmGFP to the nearest known protein. A.6. OPEN MODEL We are releasing the ESM3 source code and model weights of an open model, ESM3-open. ESM3-open is a 1.4Bparameter model we trained without OAS antibody sequences and with precautionary risk mitigations for release to the academic research community. As part of this release, we follow guidance from the Principles for the Responsible Development of AI for Biological Design (107). We adopted precautionary risk mitigations, described in Appendix A.6.1, and performed risk evaluations, detailed in Appendix A.6.2. Additionally we conducted a review of the risks and benefits of releasing ESM3-open with experts from the scientific community. We provided reviewers access to ESM3-open, along with a detailed technical report on our risk evaluations. We received unanimous feedback from our reviewers that the benefits of releasing the model greatly outweigh any potential risks. We see this release as a first step and plan to work with the scientific community to continue to improve processes around responsible development. Open models enable the scientific community to better understand and reduce any potential risks of biological design tools. As our understanding develops alongside the capabilities of future models, we plan to continuously improve our evaluation frameworks, safeguards, and mitigation strategies. A.6.1. ESM3-open Mitigations As a precaution, we filtered the training data of ESM3-open to minimize model performance on sequences of potential concern while otherwise maintaining performance. We also removed the capability for the model to follow prompts related to viruses and toxins. Filtering sequences of potential concern. Previous work has shown that the performance of protein language models is closely related to the number of similar sequences present in the training data (5). We therefore removed sequences aligned to potentially-concerning proteins from the training data in order to reduce the capability of ESM3-open on these sequences. We identified and removed sequences unique to viruses, as well as viral and non-viral sequences from the Select Agents and Toxins List (108) maintained by the CDC and USDA. The U.S. Department of Health & Human Services recommends filtering based on the Select Agents list as part of their Screening Framework Guidance for Providers and Users of Synthetic Nucleic Acids (109). 64 PREVIEWSimulating 500 million years of evolution with a language model Protein Sequence Identity to esmGFP Mutations to esmGFP Aligned Sequence B8 0.93 15 -MSKVEELIKPEMKMKLEMEGEVNGHKFSIEAEGEGKPYEGKQTIKAWSTT-GKLPFAW DILSTSLTYGFRMFTKYPEGLEEHDYFKQSFPEGYSWERTITYEDGATVKVTSDISLED GVLINKIKFKGTNFPSDGPVM-QKKTTGWEPSTELITPDPATGGLKGEVKMRLKLEGGG
esmGFP 1.0 0 -MSKVEELIKPDMKMKLEMEGEVNGHKFSIEAEGEGKPYEGKQTIKAWSTT-GKLPFAW DILSTSLTYGNRAFTKYPEGLEQHDFFKQSFPEGYSWERTITYEDGATVKVTADISLED GVLINKVKFKGENFPSDGPVM-QKKTTGWEASTELITPDPATGGLKGEVKMRLKLEGGG
tagRFP 0.58 96 MVSKGEELIKENMHMKLYMEGTVNNHHFKCTSEGEGKPYEGTQTMRIKVVEGGPLPFAF DILATSFMYGSRTFINHTQGIP--DFFKQSFPEGFTWERVTTYEDGGVLTATQDTSLQD GCLIYNVKIRGVNFPSNGPVM-QKKTLGWEANTEMLY--PADGGLEGRTDMALKLVGGG HLICNFKTTYRSKKPAKNLKMPGVYYVDHRL--ERIKEADKETYVEQHEVAVARYCDLP SKLGHKLN eqFP578 0.53 107 ----MSELIKENMHMKLYMEGTVNNHHFKCTSEGERKPYEGTQTMKIKVVEGGPLPFAF DILATSFMYGSKTFINHTQGIP--DLFKQSFPEGFTWERITTYEDGGVLTATQDTSLQN GCIIYNVKINGVNFPSNGSVM-QKKTLGWEANTEMLY--PADGGLRGHSQMALKLVGGG YLHCSFKTTYRSKKPAKNLKMPGFHFVDHRL--ERIKEADKETYVEQHEMAVAKYCDLP SKLGHR-- template 0.38 143 -MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTT-GKLPVPW PTLVTTLTYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTISFKDDGNYKTRAEVKFEG DTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYITADKQKNGIKANFKIRHNIEDGS
avGFP 0.36 146 -MSKGEELFTGVVPILVELDGDVNGHKFSVSGEGEGDATYGKLTLKFICTT-GKLPVPW PTLVTTFSYGVQCFSRYPDHMKQHDFFKSAMPEGYVQERTIFFKDDGNYKTRAEVKFEG DTLVNRIELKGIDFKEDGNILGHKLEYNYNSHNVYIMADKQKNGIKVNFKIRHNIEDGS VQLADHYQQNTPIGDGP-VLLPDNHYLSTQSALSKDPN-EKRDHMVLLEFVTAAGITHG MDELYK-- Table S14. Multiple sequence alignment of select GFP designs (B8, esmGFP) and reference proteins. Template is the full sequence of our template structure (PDB ID 1QY3), with chromophore slowing mutation A96R removed. tagRFP is the full sequence of the top hit returned by BLAST search of the nonredundant database nr, avGFP and eqFP578 are from FPBase. Sequence identities for GFP designs are in general calculated as the number of non-gap matches at aligned positions, divided by the minimum length of the query and target ungapped sequences. Here, only sequence identities to esmGFP are shown. Similarly, the number of mutations to esmGFP are calculated as the number of mismatches at aligned positions where esmGFP does not have a gap. 65 PREVIEWSimulating 500 million years of evolution with a language model Figure S23. ESM3-open is a powerful predictor of structure and function trained for open release. A: Structure Prediction ESM3- open (blue) is competitive with ESMFold (orange) on structure prediction as measured by LDDT on CAMEO and CASP14/15. See Appendix A.3.4 for details on this evaluation. B: Representation Learning ESM3-open (blue) is competitive with ESM2-3B (orange) on representation learning as measured by contact prediction P@L for finetuned representations. See Appendix A.3.3 for details on this evaluation. C: Function Keyword Prediction. ESM3-open function prediction performance, as measured by Mean Average Precision across function keywords. ESM3-open achieves 0.81 precision across all keywords, and 0.89 for the top 1K most prevalent keywords in the validation set (CAMEO). We use the same evaluation framework as in Appendix A.1.8.2.2. We report both the macro and micro averages as in Fig. S8. In each of the preceding evaluations, the data mitigation minimally impacted performance, as compared to a compute-matched model without data mitigations (hatched blue). D: Zero-shot Fitness Prediction. Fitness prediction performance as measured by correlation (Spearman ρ) across 217 Deep Mutational Scanning datasets collated in ProteinGym. Left and right subplots indicate viral (left) and non-viral (right) DMS datasets. The four columns per group indicate different models. ESM3-open performs substantially worse than EVMutation (purple) on viral fitness prediction, while being competitive with ESM2 (orange) on non-viral fitness prediction. Viral fitness prediction was substantially impacted by the data mitigation, while non-viral fitness prediction was not (hatched blue). To filter data, we create two denylists: the Viral Denylist and the Select Agent Denylist. We then remove all sequences from the training set that are detected to align to those in the denylists by MMseqs2 at or above a given sequence identity threshold. To create the Viral Denylist, we identify ∼4M sequences that are annotated as viral in UniProt and align almost exclusively to other viral sequences in UniProt. This gives us a procedure that removes viral proteins with both high sensitivity and specificity (as measured by UniProt taxonomic annotations). To create the Select Agents Denylist we identify all sequences in UniProt belonging to organisms on the Select Agents and Toxins List (108). This process gives us 147K non-viral sequences and 40K additional viral sequences. For each denylist, MMseqs was used to query against the full set of training databases, (including PDB, UniRef, MGnify, and JGI) and all hits were removed from the training set. This filter removes a total of 10.6M sequences across all training sets. Removal of keywords of concern. There are a number of keyword prompts associated with viruses and toxins that we aim to remove. We first identify a list of harmful keywords with the following steps:
Function Keyword Prediction. ESM3-open is able to predict function keywords for proteins in a validation set derived from UniRef and annotated with InterProScan, see Fig. S23C. ESM3-open achieves a Mean Average Precision for all keywords of 0.81 (macro average), and a precision of 0.89 (micro average) for the top 1000 keywords, discarding common terms such as ”the”. The evaluation framework is the same as that described in Appendix A.1.8.2.2.
Zero-shot Viral Fitness Prediction. We measure the ability of ESM3 to identify viable sequences and understand the effects of mutations on viral proteins. The evaluation consists of the single mutant variants from 217 Deep Mutational Scanning (DMS) datasets collected in ProteinGym (110). This includes 28 DMS landscapes from viral proteins and 189 from other proteins. We evaluate the correlation (Spearman ρ) between the predicted variant effect and measured variant effect. The predicted variant effect is measured as the difference between the logit value for the variant allele and the logit value of the wildtype allele at a given masked position (16).
First, we compare the performance of ESM3-open to a compute-matched version of ESM3-open which did not undergo any data filtering. Applying data filtering as a mitigation reduces average Spearman ρ performance on viral fitness prediction from 0.28 (ESM3-small) to 0.17 (ESM3-open), while performance on non-viral proteins is not adversely affected, changing from 0.46 (ESM3-small) to 0.45 (ESM3-open). We also compare the performance of ESM3-open to existing open model baselines. Fig. S23D assesses performance relative to the EVMutation (111) baseline. EVMutation is a Markov Random Field model (not deep learning-based) trained on a multiple sequence alignment of the target protein. BLOSUM62 is a baseline based on amino acid substitution frequencies. After mitigations, ESM3-open performance on viral landscapes is low compared to EVMutation and on-par with BLOSUM62. 67 PREVIEWList of Figures S1 The ESM3 architecture . . . . . . . . . . . 22 S2 Geometric Attention . . . . . . . . . . . . 25 S3 Structure tokenizer reconstruction quality . 32 S4 Visualization of structure tokenizer reconstructions . . . . . . . . . . . . . . . . . . 33 S5 Visualization of local neighborhoods which map to the same learned structure token . . 34 S6 pTM and pLDDT calibration . . . . . . . . 35 S7 Schematic of function tokenization . . . . . 35 S8 Function prediction benchmarking results . 36 S9 Visualization of noise schedules used . . . . 41 S10 Scaling curves for structure prediction . . . 43 S11 Conditional and unconditional Scaling behavior for each track . . . . . . . . . . . . 45 S12 Distribution of pTM and pLDDT . . . . . . 45 S13 Unconditional generation of high-quality and diverse proteins using ESM3 . . . . . . 47 S14 Generation of sequences using chain-ofthought . . . . . . . . . . . . . . . . . . . 48 S15 Prompting ESM3 to generalize beyond its training distribution . . . . . . . . . . . . . 50 S16 Multimodal protein editing with ESM3 . . . 54 S17 Alignment improves model generations . . 57 S18 Randomly selected successful generations from the base model and finetuned model . 58 S19 Flow cytometry data confirms cells expressing esmGFP can be detected at the single cell level . . . . . . . . . . . . . . . . . . . 63 S20 B8 Replication . . . . . . . . . . . . . . . 63 S21 Chromophore knockout mutations . . . . . 63 S22 Sequence identity of esmGFP . . . . . . . . 64 S23 ESM3-open is a powerful predictor of structure and function trained for open release . 66 List of Tables S1 Parameter details for different model configurations . . . . . . . . . . . . . . . . . . 24 S2 Training details for stage 2 training of an all-atom structure token decoder . . . . . . 31 S3 Pre-training dataset statistics . . . . . . . . 40 S4 Pre-training unique token statistics . . . . . 40 S5 Data augmentation and conditioning information applied to each dataset . . . . . . . 40 S6 Noise Schedules and Dropout Probabilities 41 S7 Precision @ L . . . . . . . . . . . . . . . . 44 S8 Protein structure prediction results . . . . . 44 S9 Negative log-likelihood of each track conditioned on other tracks . . . . . . . . . . . . 44 S10 Functional motif definitions for conserved region . . . . . . . . . . . . . . . . . . . . 50 S11 InterPro tags extracted from CAMEO test set proteins for prompting with fold specification . . . . . . . . . . . . . . . . . . . . 52 S12 Novelty and designability metrics. . . . . . 52 S13 Atomic coordination dataset . . . . . . . . 56 S14 Multiple sequence alignment of select GFP designs (B8, esmGFP) and reference proteins 65 68