Are you struggling to keep up with so many things happening in the ML world? Do you want to know what was new this week? This is (hopefully) concise content with news of things happening in the ecosystem - to the point you can get a high-level picture in under 10 minutes. If you have any feedback, send me a message on Twitter. Let’s go! 🚀
This week features Stable Diffusion in TPUs, Google’S UL2 paradigm for training LLMs, more applications of diffusion models for molecules, cool NeRF updates, and more!
Core Diffusion Models
The stream of diffusion models projects is nonstop! A team at Stanford+Google published a paper for distilling classifier-free guided models to generate samples with 1-4 sampling steps, which can be used for image and video generation. Another interesting project is GENIE from NVIDIA, which distills high-order score functions to speed up the sampling.
The Hugging Face team released a new version of their library diffusers, including Stable Diffusion JAX and TPU support, which can generate eight images using Google Colab TPU. This was used in their demo to generate over 50 million images in the last few weeks.
Last week we talked about the speedup of Stable Diffusion. Using Flash Attention, a Stanford team achieved a 3-4x faster inference than the original model (and 33% faster than the last optimized version) (code).
Diffusion Applications
Lots of exciting updates! Let’s go through a series of interesting applications, tools, and research papers:
CycleDiffusion shows that text-to-image diffusion models can also be used as zero-shot image-to-image editors (code, more code, paper).
Run Dreambooth using an 8GB GPU by using DeepSpeed and 🤗 accelerate.
DreamSpace is a nice prompt diagramming tool.
DAAM…who comes with these names? This nice paper shows how to create attribution maps and understand the impact of different tokens in different parts of the generated image (paper, demo).
Stanford fine-tuned Stable Diffusion to generate medical images such as chest xrays (paper, website).
MotionDiffuse is a cool project about motion generation using diffusion models (code, paper, website, demo).
Stable Diffusion VR and TouchDesigner there is this cool real-time immersive demo.
Last week I shared a music video with SD repo. Now there is a very nice blog post for it.
Large Language Models
The research on pre-training techniques for language models is hot these days! Google introduces UL2, a paradigm for pre-training which uses a mixture of training objectives (blog post, checkpoints).
AllenAI released RL4LM, a library to fine-tune language models using RL (website, paper), sharing the GRUE benchmark of 6 NLP tasks with 2000 experiments and multiple metrics which can be used as reward functions.
The BigScience team wrote a very nice blog post about achieving high-scale inference with 5x latency reduction, from pipeline parallelism to TPU execution, DeepSpeed, ONNX, and more! On the topic of fast inference, GLM-130B by THUDM (code, paper, demo) is a bilingual pre-trained model of 130 billion parameters designed for fast inference, with techniques that allow running the model with 3090s!
Other updates:
Google proposed fine-tuned LLMs with HTML understanding capabilities (such as semantic classification and navigation (paper, website).
ReAct combines the LLMs capabilities for reasoning and acting (paper).
Mind’s Eye explores combining LLMs with physical engines to improve the reasoning based on simulations (paper).
The Online Language Modeling (OLM) project aims to train LMs based on the latest Common Crawl and Website snapshots.
Molecules and Proteins
Did you think the diffusion section was all of it? Here is a comeback!
DiffDock is a diffusion method for molecular docking (paper, demo).
DiffLinker is a diffusion model that generates molecular linkers in disconnected fragments in 3D (paper).
EquiFold predicts protein structure using refinement of a new proposed structure representation that keeps atom structure resolution (paper).
SQUID is a generative model that generates molecules that fit a target shapes (paper).
Reinforcement Learning and Simulations
One very exciting update from this week is that Meta achieved expert-human-level training models in Diplomacy and Hanabi, games involving cooperation, and is usually a big challenge in RL (paper, more paper).
This research explores whether RL models like AlphaZero have scaling laws like the LLMs. Spoiler alert: they do (paper).
Hora by UC Berkeley is a single policy for rotating objects with a robotic hand. It’s trained fully in a simulation without cameras and touch sensors, interesting! (code, paper, website)
Computer Vision
Lately, I have seen exciting updates in the NeRF space. For example, BAIR released nerfacc, an acceleration toolbox for fast differentiable volumetric rendering with a nice Python API (code, docs, paper). nerfacc is used under the hood in nerfstudio, another very exciting tool in the space! A third thing to share is NeRF2Real, a project from DeepMind that turns 5-minute videos into simulations where they train vision-guided robots.
Let’s share some high-level updates:
Diffusion models again! This project uses diffusion models that generate images based on markup! (code, paper, demo).
Pix2Struct is a model by Google pretrained to parse screenshots into simplified HTML. It is then finetuned on tasks such as Visual Question Answering with documents, UIs, illustrations, and images (paper).
Not new, but LAION-5B, a large dataset of image-text pairs used for many models such as Stable Diffusion, has a cool paper that will be presented in NeurIPS.
Omni3D, a benchmark, and model for 3D object detection, is open-sourced! (code, paper, website, code).
Self pose, a self-supervised technique for 6D pose estimation (website, paper).
Pix2Seq-D proposes a way to convert panoptic segmentation to a discrete data generation task conditioned on pixels. This is an interesting proposal as there is a generic architecture and loss function. This uses Bit Diffusion under the hood to generate panoptic segmentation.
EVA3D is a generative model that can generate 3D human objects based on images (code, paper, website).
ML for Audio
The amazing team at NVIDIA released NeMo 1.12. NeMo is a very cool toolkit for audio tasks (such as Automatic Speech Recognition and Text to Speech). Hugging Face released support for OpenAI Whisper, a very powerful audio model that can do Automatic Speech Recognition and even translate the transcript on top of it!
Misc Readings
swyx published a very interesting post about how OS is eating AI. It talks about the open source cycles (e.g. GPT-3 open-source alternative came out ten months after the release, text to image took four months, etc). The text-to-image section misses some viral projects such as DALL-E Mini (now Craiyon), which, although not on-par quality-wise, brought text-to-image to a wider population. I recommend the reading, it’s very well-written and discusses the different issues and challenges faced in the ecosystem.
Nathan Benaich and Ian Hogarth published the current State of AI Report for 2022. It’s a comprehensive view of the state of the field, from research to industry and politics.
This week was the Google Workshop on Sparsity and Adaptive Computation, with many speakers from Google, industry labs, and universities talking about scaling language and vision models. All videos are online!
I hope you enjoyed reading this! Feel free to share any feedback on Twitter.