DESIGN OF AUGMENTED DIFFUSION MODEL FOR TEXT-TO-IMAGE REPRESENTATION

Subuhi Kashif Ansari; Rakesh Kumar

doi:10.26483/ijarcs.v16i1.7191

PDF

Published: Feb 28, 2025

DOI: https://doi.org/10.26483/ijarcs.v16i1.7191

Keywords:

T2I synthesis, Generative models, Multimodal learning, Dataset, IS, FID

Subuhi Kashif Ansari

Research Scholar, School of Engineering & Technology, Shri Venkateshwara University, Gajraula, U.P., India

Rakesh Kumar

Assistant Professor, School of Science and Technology, Shri Venkateshwara University, Gajraula, U.P., India

Abstract

: The exciting as well as demanding challenge of generating images from natural language commands is no small feat. This research provides a new architectural paradigm for Text to Images (T2I) approaches and demonstrate that a well-designed neural architecture may reach state-of-the-art (SOTA) presentation with only one generator and one discriminator, all in one step of training. This paper concludes with a call to action for researchers in the arena of T2I conversion, which has just begun to explore novel neural architectures. This work uses a Contrastive Language-Image Pretraining (CLIP) + Image approach to T2I generation, which optimises in the hidden space of a regular Generative Adversarial Network (GAN) to identify images with the highest possible semantic significance score given the input score as determined by the CLIP model. The CLIP+GAN technique is zero-shot, flexible, and doesn't need training, in contrast to conventional approaches that start from the beginning when training generative models to map T2I. With these essential methods, this research proposes an enhanced GAN that enhances the CLIP + GAN approach in this work. A CLIP score is made in this research which is more robust. Finally, this research can produce superior images with diverse objects, creative styles, backdrops as well as original counterfactual notions that do not occur in the training data of the GAN that this work utilise when supported by various input text. For the dataset created, the proposed methodology output images achieve top-level Inception Score (IS) as well as Frechet Inception Distance (FID) scores quantitatively, all without any further architectural design or training.

Downloads

Download data is not yet available.

Issue

Vol. 16 No. 1 (2025): January-February 2025

Section

Articles

COPYRIGHT

Submission of a manuscript implies: that the work described has not been published before, that it is not under consideration for publication elsewhere; that if and when the manuscript is accepted for publication, the authors agree to automatic transfer of the copyright to the publisher.

Authors who publish with this journal agree to the following terms:

Authors retain copyright and grant the journal right of first publication with the work simultaneously licensed under a Creative Commons Attribution License that allows others to share the work with an acknowledgment of the work's authorship and initial publication in this journal.
Authors are able to enter into separate, additional contractual arrangements for the non-exclusive distribution of the journal's published version of the work (e.g., post it to an institutional repository or publish it in a book), with an acknowledgment of its initial publication in this journal.
Authors are permitted and encouraged to post their work online (e.g., in institutional repositories or on their website) prior to and during the submission process, as it can lead to productive exchanges, as well as earlier and greater citation of published work
The journal allows the author(s) to retain publishing rights without restrictions.
The journal allows the author(s) to hold the copyright without restrictions.

Article Sidebar

Main Article Content

Abstract

Downloads

Article Details

Most read articles by the same author(s)