Portal Weekly #48: a ICLR 2024 recap of all things TechBio

Join Portal and stay up to date on research happening at the intersection of tech x bio 🧬

Therence

Jonny Hsu

, and

Shawn Whitfield

May 17, 2024

Hi everyone 👋

Welcome to another issue of the Portal newsletter where we provide weekly updates on everything happening in TechBio research! This week we’re changing our normal schedule and format. We’ll be doing a recap of ICLR 2024-associated papers (that we haven’t previously covered!).

💻 Latest Blogs

Peptides can be new antibiotics or anticancer drugs, but it can be hard to know what function a given peptide will have. In the latest blog post, Raúl Fernández Díaz presents AutoPeptideML, a tool for automating key steps in the development cycle of peptide bioactivity predictors, and walks you through its use. Check it out here!

Interested in sharing your work with the community? Fill out our interest form to get started.

💬 Upcoming talks

Talks are paused in the lead-up to the NeurIPS deadline, but check back soon for more talk announcements!

If you enjoy this newsletter, we’d appreciate it if you could forward it to some of your friends! 🙏

Let’s jump right in!👇

📚 Community Reads

We chose a cross-section of papers and talks that stood out to us from ICLR and associated workshops, but can’t cover it all! Use this as a jumping-off point and check out the links to the workshops at the bottom of the section.

Proving Test Set Contamination in Black-Box Language Models

This paper studies the problem of detecting if a test set is present in the pretraining data of a language model. The main idea behind their approach is that for test sets that have some canonical order of individual instances (e.g.: the order in which the dataset creators release the dataset), the likelihood of the test set in that order would be significantly higher than any random permutation of the dataset.

Towards Foundational Models for Molecular Learning on Large-Scale Multi-Task Datasets

This paper presents seven new datasets for pushing the boundaries of scale and diversity of supervised labels for molecular learning. It also presents the Graphium graph machine learning library, a useful resource for building and training molecular machine learning models for multi-task and multi-level molecular datasets.

Unified Generative Modeling of 3D Molecules via Bayesian Flow

Bayesian Flow Networks (BFN) are a recently proposed generative model similar to diffusion models, that can handle discrete and discretized variables as well as continuous variables. The authors suggest using BFNs to sample molecules, showing strong performance on several tasks.

RetroBridge: Modeling Retrosynthesis with Markov Bridges

You can design the best drugs in silico, but you have to be able to make them. This paper introduces a template-free modeling approach that seems better suited than diffusion models in cases where two intractable discrete distributions need to be mapped.

PROflow: An iterative refinement model for PROTAC-induced structure prediction

PROTACs are small molecules that trigger the breakdown of traditionally undruggable proteins by binding simultaneously to their targets and degradation-associated proteins. This study presents a pseudo-data generation scheme that enables a model that handles the full PROTAC flexibility during constrained protein-protein docking.

Generative Active Learning for the Search of Small-molecule Protein Binders

This paper presents a fast deep reinforcement learning model for searching molecular space for desired properties. Applying the model to search for inhibitors of soluble Epoxide Hydrolase 2 (sEH), they find and synthesize molecules that show sub-micromolar activity in vitro.

Evaluating Representation Learning on the Protein Structure Universe

This paper introduces a benchmark suite for evaluating protein structure representation learning methods and their application in downstream tasks, incorporating pre-training methods, downstream tasks, and pre-training corpora.

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genomes

This paper improves upon DNABERT by replacing k-mer tokenization with byte pair encoding and adding recent improvements in transformer architecture. They release a multi-species dataset for benchmarking, and also suggest that the model has approximately comparable results with HyenaDNA with inputs up to 32k.

Many ICLR2024 conference/workshop papers have associated blog posts or reading group talks on Portal! Check them out for a complementary perspective to the papers:

There was plenty more interesting research being presented, so check out the following websites for more:

Generative and Experimental Perspectives for Biomolecular Design (GEM-2024) workshop (the papers are here)
Machine Learning for Genomics Explorations (MLGenX) workshop (the papers are here)

Think we missed something? Join our community to discuss these topics further!

🎬 Latest Recordings

LoGG

KAN: Kolmogorov-Arnold Networks by Ziming Liu

CARE

Synthetic Combinations: A Causal Inference Framework for Combinatorial Interventions by Anish Agarwal

You can always catch up on previous recordings on our YouTube channel or the Valence Portal Events page!

Portal is the home of the TechBio community. Join here and stay up to date on the latest research, expand your network, and ask questions. In Portal, you can access everything that the community has to offer in a single location:

M2D2 - a weekly reading group to discuss the latest research in AI for drug discovery
LoGG - a weekly reading group to discuss the latest research in graph learning
CARE - a weekly reading group to discuss the latest research in causality
Blogs - tutorials and blog posts written by and for the community
Discussions - jump into a topic of interest and share new ideas and perspectives

See you at the next issue! 👋