Portal Weekly #65: protein design competition, one-shot design of binders, sequence-structure co-generation for design, and more.
Join Portal and stay up to date on research happening at the intersection of tech x bio 🧬
Hi everyone 👋
Welcome to another issue of the Portal newsletter where we provide weekly updates on talks, events, and research in the TechBio space!
📅 Protein Design Competition Kickoff
Our friends at Polaris, in collaboration with Adaptyv Bio & Dimension, are launching round 2 of their EGFR binder design challenge! You’ll get access to newly generated data from round 1 to refine your designs, and top 3 participants will earn a speaking slot at AIDrugX workshop at NeurIPS.
To kick things off, they’re hosting in-person hackathons on Saturday, October 12th in NY, SF, Montreal, London, and Lausanne. Connect with the community, hear from experts, and start your design journey. Anyone can register!
Sign up here: https://protein.polarishub.io/kick-off-events/
In inspiration of the competition, this week, we’re focused on highlighting recent work related to proteins! Have any cool methods for protein design you’ve recently seen? Drop us a comment. 👇
💬 Upcoming talks
LoGG continues next week with a talk by Ali Saberi, who will present a paper introducing LoRNASH, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture. He will be able to answer all your questions!
Join us live on Zoom on Monday at 12pm ET. Find more details here. Make sure to check the LoGG calendar for future talks updates.
You can also follow us on Twitter and LinkedIn!
If you enjoy this newsletter, we’d appreciate it if you could forward it to some of your friends!
Let’s jump right in!👇
📚 Community Reads
LLMs for Proteins
Protein Language Model Fitness is a Matter of Preference
This paper looks into how protein language models (pLMs), which are trained on huge amounts of evolutionary data, can help us better understand and design proteins that actually work. To figure out how good these models are, the researchers used a process called deep mutational scanning. They take a protein, make some changes to specific parts of it, and then test the new versions to see how well they perform—things like binding, fluorescence, or stability. Each version gets a fitness score, which shows how well the model can predict how these changes will affect the protein based on what it’s learned before. What they found is pretty interesting: the model’s ability to predict protein fitness really depends on how “familiar” the original sequence is to the model. If the sequence lines up with what the model already knows from its training data, it’s great at making predictions. But if the sequence is way off from what the model has seen, the predictions aren’t as reliable. This means that pLMs don’t just capture natural patterns - they also reflect the biases from the data they’ve been trained on. That said, the study also shows that you can improve the model’s performance for those tricky, less familiar sequences by doing some fine-tuning, making these models even more helpful for protein design. In the end, it turns out that predicting how mutations will affect proteins isn’t just about following natural mutation patterns. It also depends on how much the model's preferences, shaped by its training, align with the original sequence.
ML for Protein Binding
BindCraft: one-shot design of functional protein binders
This paper introduces BindCraft, a powerful open-source tool that leverages deep learning models like AlphaFold2 to design protein binders. Since protein-protein interactions (PPIs) are crucial for most biological processes, but notoriously difficult to design, BindCraft offers a much-needed solution by automating the creation of binders with nanomolar affinities and success rates between 10-100%, all without requiring high-throughput screening or extensive experimental optimization. The tool has proven effective across a variety of challenging targets, including cell-surface receptors, allergens, and CRISPR-Cas9 proteins, highlighting its potential for therapeutic applications. For example, BindCraft-designed binders significantly reduced IgE binding to birch allergens in patient samples, showcasing its functional impact. The process behind BindCraft is based on several key steps, as illustrated in the binder design pipeline. Starting with a target protein structure, a binder backbone and sequence are generated, and then optimized using a combination of deep learning tools while maintaining the crucial interface between binder and target intact. Only the most promising designs are selected for further testing based on their predicted performance. The results have been promising - many of the designs showed binding in experimental measurements like SPR, and some reached impressive binding affinities without any need for further sequence optimization, illustrating the efficiency and precision of the approach.
Allo-Allo: Data-efficient prediction of allosteric sites
This paper introduces Allo-Allo, a cutting-edge, sequence-based method designed to predict allosteric sites—specific regions in proteins where binding of a ligand can influence distant functional areas. Allostery plays a critical role in many drug-target proteins, such as GPCRs, but predicting these sites has been historically difficult due to limited experimental data. Allo-Allo tackles this problem by harnessing protein language models (PLMs), specifically using the attention heads from ESM-2, to detect allosteric residue interactions with significantly improved accuracy. The method shows a 67% higher area under the precision-recall curve (AUPRC) compared to existing techniques, making it far more reliable. What makes Allo-Allo unique is its data efficiency, outperforming other PLM-based methods while generalizing well to a wide range of proteins. The inner workings of Allo-Allo are based on a detailed analysis of how PLM attention heads process pairwise residue relationships. As depicted in the method’s pipeline, each of the attention heads within the PLM layers is evaluated for its contribution to the overall allosteric signal. Allosteric sites with a high impact are identified and used to select attention heads that are sensitive to allosteric changes. This information is then fed into a random forest classifier to make precise predictions about allosteric sites.
Binding Affinity Prediction: From Conventional to Machine Learning-Based Approaches
This paper reviews the essential process of protein-ligand binding, a critical interaction where small molecules bind to target proteins, altering their functions. Predicting the strength of these interactions, known as binding affinity, is a central focus in drug design. While early insights came from chemistry and quantum mechanics-based approaches, the rise of machine learning (ML) and deep learning (DL) has significantly advanced predictions. These modern models, fueled by growing protein-ligand datasets, offer better estimates of binding constants and poses. However, challenges persist due to the limitations of current datasets and the complexity of sub-problems like scoring, docking, and ranking. The paper traces the evolution of binding affinity prediction methods, from traditional to ML and DL-based techniques, providing a comprehensive look at key datasets and benchmarks like PDBbind and MOAD. One area of difficulty involves models that attempt to predict binding affinity without detailed interaction information. These so-called interaction-free models often lack the critical details needed to make accurate predictions, especially in capturing the long-range interactions between proteins and ligands. Graph neural networks (GNNs), for example, have shown promise but still struggle with this aspect, which is crucial for reliable predictions. Even with the advancements in machine learning, this remains a significant obstacle in the quest for more accurate and efficient binding predictions.
Generalized Protein Pocket Generation with Prior-Informed Flow Matching
This paper introduces PocketFlow, a generative model designed to address the complexities of creating ligand-binding protein pockets, a key challenge in bioengineering and protein biology. Traditional methods for generating these pockets often rely on slow physical simulations or template-based approaches, which can compromise speed and accuracy. PocketFlow offers a more efficient solution by integrating knowledge of protein-ligand interactions, such as hydrogen bonding, into a flow matching model. During training, the model learns the detailed interactions between proteins and ligands, while during sampling, it uses multi-granularity guidance to ensure that the generated pockets are both high-affinity and structurally valid. In PocketFlow’s process, the protein-ligand complex is parameterized by modeling the protein as a sequence of amino acid residues, and the ligand as a generalized atom-level structure. The approach focuses on defining the protein pocket based on the residues closest to the ligand, while carefully considering key protein-ligand interaction types like hydrogen bonding, salt bridges, and hydrophobic interactions. The model uses these parameters to ensure the generated pockets not only fit geometrically but also optimize for high binding affinity. This combination of structural precision and interaction-based guidance enables PocketFlow to generate highly accurate and functional protein pockets, making it a major advancement in the field.
ML for Protein Structures and Sequences
Towards deep learning sequence-structure co-generation for protein design
This paper explores the potential of deep generative models in the field of protein design. These models, trained on the natural diversity of biomolecules, have the ability to generate new, functional proteins by leveraging patterns in protein sequences and structures. Proteins are composed of amino acid sequences that dictate their 3D structure and, in turn, their function. The paper outlines two primary approaches to protein design: structure-based models, which generate a structural backbone first, and sequence-based models, which directly generate protein sequences. The generative modeling process for protein design can be visualized in a few steps. First, natural proteins represent biased samples from the vast distribution of possible sequences, with each protein's sequence determining its structure and function. Structure-based models focus on building the backbone atoms common to all amino acids, while the sequence-based models predict the side chains that determine the identity of each residue. Co-generation models, which are highlighted as the next evolution, aim to unify both sequence and structure predictions, allowing for a more integrated and efficient approach to designing proteins. While both approaches have their advantages, they also come with limitations. To address these challenges, a promising new method called sequence-structure co-generation combines the best of both worlds by generating protein sequences and their corresponding structures simultaneously, leading to more accurate and diverse protein designs.
EvoSeq-ML: Advancing Data-Centric Machine Learning with Evolutionary-Informed Protein Sequence Representation and Generation
This paper emphasizes the crucial role of high-quality data in protein engineering, particularly in the context of how data curation affects the performance of machine learning (ML) models. While advancements like AlphaFold have revolutionized the field, this paper introduces a novel, data-centric approach that integrates ancestral sequence reconstruction (ASR) into ML-driven protein design. ASR uses computational methods to infer ancient protein sequences, which are often more stable, diverse, and rich in evolutionary information compared to modern sequences found in typical public databases. The process of generating this high-quality evolutionary data is a key aspect. Starting with the phylogenetic tree for the protein family of interest, ancestral sequences are inferred through methods like IQ-TREE and BAli-Phy, which co-estimate phylogeny and alignment while providing an ensemble of possible ancestral sequences. These ancestral sequences, known for their stability and evolvability, serve as a rich training dataset for ML models focused on sequence generation and family-specific protein representation. This integration of evolutionary data leads to improved model performance, particularly in applications involving proteins like Lysozyme C and Endolysin, showcasing the potential of combining evolutionary insights with modern computational techniques.
Functional Protein Mining with Conformal Guarantees
This paper presents a novel method to enhance protein function prediction and homology detection by incorporating conformal prediction principles. Traditional protein search methods often lack statistical reliability, which complicates the selection of proteins for further study. The proposed framework offers a solution by providing statistical guarantees, allowing for calibrated probabilities that help users manage risks, such as false discovery rates, more effectively. The design of this method involves using a protein homology search model to compare a query sequence against a database, generating similarity scores. These scores are calibrated against a threshold, and only those above the threshold are included in the retrieval set. The process also involves computing similarity scores on calibration data to establish statistical guarantees, enhancing the interpretability and confidence in protein search results. For example, by assigning a hierarchical loss score when mismatches occur, the method can quantify how far a retrieved protein deviates from the true function, ensuring a more robust and informed search process. This approach enables the assignment of reliable functional probabilities to proteins with unknown functions and improves enzyme classification accuracy, all without requiring new models.
SymProFold: Structural prediction of symmetrical biological assemblies
This paper presents the SymProFold pipeline, a tool designed to predict symmetrical protein assemblies, such as bacterial S-layers and viral capsids, using AlphaFold-Multimer’s accurate predictions. Symmetry is a key feature in many biological systems, playing crucial roles in cell adhesion, immune evasion, and environmental protection. However, studying these structures experimentally can be difficult due to their self-assembly nature and high sequence variability. SymProFold addresses this by predicting two-dimensional S-layer arrays and spherical viral capsids, testing various symmetry operations (p1, p2, p3, p4, p6) to identify the most probable assembly. The predictions were validated against experimental data and confirmed through crystal structures, making SymProFold a powerful tool for understanding and leveraging protein symmetry.
Open Source for Proteins
Technical Report of HelixFold3 for Biomolecular Structure Prediction
The AlphaFold series has revolutionized protein structure prediction by providing remarkably accurate models that often rival experimental methods. AlphaFold2 and AlphaFold-Multimer, which focus on predicting single protein chains and protein complexes, are open-sourced, making them widely accessible for research. However, AlphaFold3, the latest model, is only partially available through a limited online server and has not been open-sourced, which restricts further development and collaboration. To overcome these limitations, the PaddleHelix team has developed HelixFold3, an open-source alternative that aims to replicate AlphaFold3’s capabilities. By building on previous AlphaFold models and large datasets, HelixFold3 achieves similar accuracy in predicting the structures of proteins, nucleic acids, and ligands. Its initial release is available on GitHub, which will likely accelerate advances in biomolecular research.
Reviews
Think we missed something? Join our community to discuss these topics further!
🎬 Latest Recordings
You can always catch up on previous recordings on our YouTube channel or the Portal Events page!
Portal is the home of the TechBio community. Join here and stay up to date on the latest research, expand your network, and ask questions. In Portal, you can access everything that the community has to offer in a single location:
M2D2 - a weekly reading group to discuss the latest research in AI for drug discovery
LoGG - a weekly reading group to discuss the latest research in graph learning
CARE - a weekly reading group to discuss the latest research in causality
Blogs - tutorials and blog posts written by and for the community
Discussions - jump into a topic of interest and share new ideas and perspectives
See you at the next issue! 👋