DESIGN OF AUGMENTED DIFFUSION MODEL FOR TEXT-TO-IMAGE REPRESENTATION

Main Article Content

Subuhi Kashif Ansari
Rakesh Kumar

Abstract

: The exciting as well as demanding challenge of generating images from natural language commands is no small feat. This research provides a new architectural paradigm for Text to Images (T2I) approaches and demonstrate that a well-designed neural architecture may reach state-of-the-art (SOTA) presentation with only one generator and one discriminator, all in one step of training. This paper concludes with a call to action for researchers in the arena of T2I conversion, which has just begun to explore novel neural architectures. This work uses a Contrastive Language-Image Pretraining (CLIP) + Image approach to T2I generation, which optimises in the hidden space of a regular Generative Adversarial Network (GAN) to identify images with the highest possible semantic significance score given the input score as determined by the CLIP model. The CLIP+GAN technique is zero-shot, flexible, and doesn't need training, in contrast to conventional approaches that start from the beginning when training generative models to map T2I. With these essential methods, this research proposes an enhanced GAN that enhances the CLIP + GAN approach in this work. A CLIP score is made in this research which is more robust. Finally, this research can produce superior images with diverse objects, creative styles, backdrops as well as original counterfactual notions that do not occur in the training data of the GAN that this work utilise when supported by various input text. For the dataset created, the proposed methodology output images achieve top-level Inception Score (IS) as well as Frechet Inception Distance (FID) scores quantitatively, all without any further architectural design or training.

Downloads

Download data is not yet available.

Article Details

Section
Articles