Portal Weekly #48: a ICLR 2024 recap of all things TechBio
Join Portal and stay up to date on research happening at the intersection of tech x bio 🧬
Hi everyone 👋
Welcome to another issue of the Portal newsletter where we provide weekly updates on everything happening in TechBio research! This week we’re changing our normal schedule and format. We’ll be doing a recap of ICLR 2024-associated papers (that we haven’t previously covered!).
💻 Latest Blogs
Peptides can be new antibiotics or anticancer drugs, but it can be hard to know what function a given peptide will have. In the latest blog post, Raúl Fernández Díaz presents AutoPeptideML, a tool for automating key steps in the development cycle of peptide bioactivity predictors, and walks you through its use. Check it out here!
Interested in sharing your work with the community? Fill out our interest form to get started.
💬 Upcoming talks
Talks are paused in the lead-up to the NeurIPS deadline, but check back soon for more talk announcements!
If you enjoy this newsletter, we’d appreciate it if you could forward it to some of your friends! 🙏
Let’s jump right in!👇
📚 Community Reads
We chose a cross-section of papers and talks that stood out to us from ICLR and associated workshops, but can’t cover it all! Use this as a jumping-off point and check out the links to the workshops at the bottom of the section.
Proving Test Set Contamination in Black-Box Language Models
This paper studies the problem of detecting if a test set is present in the pretraining data of a language model. The main idea behind their approach is that for test sets that have some canonical order of individual instances (e.g.: the order in which the dataset creators release the dataset), the likelihood of the test set in that order would be significantly higher than any random permutation of the dataset.
Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets
This paper presents seven new datasets for pushing the boundaries of scale and diversity of supervised labels for molecular learning. It also presents the Graphium graph machine learning library, a useful resource for building and training molecular machine learning models for multi-task and multi-level molecular datasets.
Unified Generative Modeling of 3D Molecules via Bayesian Flow
Bayesian Flow Networks (BFN) are a recently proposed generative model similar to diffusion models, that can handle discrete and discretized variables as well as continuous variables. The authors suggest using BFNs to sample molecules, showing strong performance on several tasks.
RetroBridge: Modeling Retrosynthesis with Markov Bridges
You can design the best drugs in silico, but you have to be able to make them. This paper introduces a template-free modeling approach that seems better suited than diffusion models in cases where two intractable discrete distributions need to be mapped.
PROflow: An iterative refinement model for PROTAC-induced structure prediction
PROTACs are small molecules that trigger the breakdown of traditionally undruggable proteins by binding simultaneously to their targets and degradation-associated proteins. This study presents a pseudo-data generation scheme that enables a model that handles the full PROTAC flexibility during constrained protein-protein docking.
Generative Active Learning for the Search of Small-molecule Protein Binders
This paper presents a fast deep reinforcement learning model for searching molecular space for desired properties. Applying the model to search for inhibitors of soluble Epoxide Hydrolase 2 (sEH), they find and synthesize molecules that show sub-micromolar activity in vitro.
Evaluating Representation Learning on the Protein Structure Universe
This paper introduces a benchmark suite for evaluating protein structure representation learning methods and their application in downstream tasks, incorporating pre-training methods, downstream tasks, and pre-training corpora.
DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes
This paper improves upon DNABERT by replacing k-mer tokenization with byte pair encoding and adding recent improvements in transformer architecture. They release a multi-species dataset for benchmarking, and also suggest that the model has approximately comparable results with HyenaDNA with inputs up to 32k.
Many ICLR2024 conference/workshop papers have associated blog posts or reading group talks on Portal! Check them out for a complementary perspective to the papers:
CellPLM: Pre-training of Cell Language Model Beyond Single Cells (paper here)
BEND - Benchmarking DNA Language Models on Biologically Meaningful Tasks (paper and code here)
SaProt: Protein Language Modeling with Structure-aware Vocabulary (paper here)
Removing Biases from Molecular Representations via Information Maximization (paper here)
Protein Discovery with Discrete Walk-Jump Sampling (paper here)
Re-evaluating Retrosynthesis Algorithms with Syntheseus (paper here)
SE(3)-Stochastic Flow Matching for Protein Backbone Generation (paper here)
On the Scalability of GNNs for Molecular Graphs (paper here)
There was plenty more interesting research being presented, so check out the following websites for more:
Generative and Experimental Perspectives for Biomolecular Design (GEM-2024) workshop (the papers are here)
Machine Learning for Genomics Explorations (MLGenX) workshop (the papers are here)
Think we missed something? Join our community to discuss these topics further!
🎬 Latest Recordings
LoGG
KAN: Kolmogorov-Arnold Networks by Ziming Liu
CARE
Synthetic Combinations: A Causal Inference Framework for Combinatorial Interventions by Anish Agarwal
You can always catch up on previous recordings on our YouTube channel or the Valence Portal Events page!
Portal is the home of the TechBio community. Join here and stay up to date on the latest research, expand your network, and ask questions. In Portal, you can access everything that the community has to offer in a single location:
M2D2 - a weekly reading group to discuss the latest research in AI for drug discovery
LoGG - a weekly reading group to discuss the latest research in graph learning
CARE - a weekly reading group to discuss the latest research in causality
Blogs - tutorials and blog posts written by and for the community
Discussions - jump into a topic of interest and share new ideas and perspectives
See you at the next issue! 👋