Are you struggling to keep up with so many things happening in the ML world? Do you want to know what was new this week? This is concise content with news of the ecosystem - you can get a high-level picture in under 10 minutes! If you have any feedback, send me a message on Twitter. Let’s go! 🚀
This week features Stable Diffusion 1.5, different papers on editing images with text and lots of exciting work in simulations.
ML and Creativity
Runway released Stable Diffusion 1.5, and later in the week, StabilityAI released a fine-tuned image decoder. You can combine both to do the latest-quality Stable Diffusion!
Imagic was one of the most exciting updates of the week. Imagic is a model using diffusion models for editing an image based on a text without requiring any other input, such as masks. The community got it working in a Colab, check it here! (paper).
DiffEdit: Another work for image editing based on text, this time by Meta. This technique automatically generates masks highlighting which regions need to be edited (paper).
Prompt-to-prompt is the third project on image editing (code, website, paper).
Runway released Stable Diffusion Inpainting, the latest/best inpainting model (model). Runway also announced a text-to 3D-texture project.
This week many people got roasted by CLIP Interrogator, a project in which you submit an image and it returns a prompt that might be good for creating similar images. Try the demo here.
Lots of exciting work in the music generation space! Check out the Audio and Speech section below.
Natural Language Processing
Google releases Flan-T5, a general-purpose model that can do 1800 different tasks given instructions. For example, you can ask it to translate, do logical reasoning, answer scientific questions, and more! (models, paper, demo)
Google also releases U-PaLM, which proposes a method with a new objective that improves the scaling curves, which ends up saving huge amounts of computing (paper)
Carper announces a partnership with ScaleAI, StabilityAI, Humanloop, Multi, and Hugging Face, to create an open-source RL-based instruction-tuned language model (press release).
ScienceQA is a new benchmark of 21k multimodal multiple-choice questions in different science topics. The goal is to train models that learn the chain of thought to generate lectures and explanations. Very cool! (code, website, paper)
MTEB is a benchmark for evaluating text embeddings in eight tasks covering 56 datasets(code, paper).
Computer vision
Google publishes MUSIQ, an approach for image quality assessment, a task in which the model outputs a quality score. This task is often approached with CNNs, which imply different constraints, such as fixed-size input. This constraint is bypassed with a transformer architecture with a special embedding system (paper, code, blog)
LION is a diffusion model for 3D shape generation (code, website, paper)
MoRig is a technique that automatically rigs character meshes. (website, paper)
GenSDF is a technique to learn neural signed distance functions which outperforms other methods and can do zero-shot inference on over a hundred of unseen classes (code, website, paper)
IronDepth is a framework for iteratively refining a predicted depth map (code, paper).
Real-time fusion is a thing! Rendering to feature space, this project enables the grouping and segmenting of similar objects in real-time (website, paper).
Reinforcement Learning and Simulations
CommonSim-1 is a neural simulation engine. One can dynamically create 3D environments, which becomes useful for interactive simulations for training RL agents, games, and much more!
A new research lab, Generally Intelligent, is open-sourcing Avalon, a fast 3D world simulator for RL agents (launch blog).
GriddlyJS is a system that allows designing, building, and debugging RL environments and running policies.
E3B - A new algorithm for exploration in varying environments - where the environment changes in each episode. (code, paper, website).
UC Berkeley announces Intrinsic Reward Matching, a framework that unifies unsupervised pretraining and downstream learning (code, paper)
GoalsEye is a system that learns how to match behavior with simple supervised learning (paper, website, blog)
You Only Live Once - this paper from Stanford formalizes the problem of single-life Reinforcement Learning, where an agent must complete a task without interventions just from previous experiences (paper).
Audio and Speech
Meta announced the Universal Speech Translator, a speech-to-speech system that can translate directly between spoken languages (code, blog, demo)
Try out Musika, a system that can generate stereo music. One can train or fine-tune the model with your music dataset (code, paper, demo).
On the topic of music, Museformer is out! Museformer is a transformer that can generate music (website, paper)
Learning Resources
Jeremy Howard released the first part of the new fastai course, “From Deep Learning Foundations to Stable Diffusion”. Check it out here
This math book tries to cover all the math needed for Machine Learning. It covers algebra, topology, calculus and optimization theory.
Stanford Deep Generative Models course opened its notes and course materials!
I hope you enjoyed reading this! Feel free to share any feedback on Twitter. Have a great week!