Google Launches Muse, A New Text-to-Image Transformer Model

From the start of 2021, Artificial Intelligence research has been entirely reshaped by a multitude of deep learning-backed text-to-image models like DALL-E-2, Stable Diffusion, and Mid journey. And now Google’s Muse joins that list; a text-to image Transformer model boasting remarkable image generation performance.

After obtaining the text embedding from a pre-trained large language model, Muse was trained to mask and predict randomly selected image tokens in discrete token space. This powerful tool has enabled us to accurately anticipate masked images with astounding accuracy.

Unlike traditional pixel-space diffusion models like Imagen and DALL-E 2, Muse is a more efficient model since it uses discrete tokens that necessitate fewer sample iterations. This innovative technique produces mask-free editing with no additional costs by iteratively resampling image tokens based on the text prompt provided.

Learn more about Muse and all its incredible features here!

In contrast to Parti and other autoregressive models, Muse utilizes parallel decoding. Thanks to an already trained LLM, fine-grained language understanding is enabled which in turn leads to high quality image generation capabilities as well as comprehending visual concepts including objects, their spatial relationships, pose and cardinality – just to name a few. Moreover, no need for modification or inverting the model: Muse allows you to perform inpainting outpainting and mask-free editing with ease!

Our 900M parameter model has achieved a new State-of-the-Art (SOTA) with an impressive FID score of 6.06 on CC3M. Additionally, our Muse 3B parameter model was tested in zero shot COCO and yielded a remarkable FID of 7.88 as well as an equally impressive CLIP score of 0.32!

The base Transformer layer’s text encoder produces a text embedding that cross-examines image tokens, whereas the super-res model’s VQ Tokenizer generates a 16*16 token space after being prepped on lower resolution (256*256) imagery.

The cross-entropy loss adapts to guess the obscured tokens which were randomly masked per sample. After training the base model, we feed both reconstructed lower resolution tokens and text tokens into a super-res model that takes it up another notch; now our system can predict those masked elements with greater precision.

The performance of Muse has proven to be superior compared to other text-to-image models. What sets it apart from the rest is its ability to generate images in a fraction of the time, with fewer iterations and higher accuracy. We are excited to see what Muse can do next. 

So, if you’re looking for a tool to generate photo-realistic images from text prompts quickly and accurately, then Muse is the model you are looking for!  Check it out now and see how quickly it can meet your needs. 

Don’t miss the chance to experience the future of image generation right now!  Get started with Muse and get ready to be blown away.

Leave a Comment