An Introduction to VQGan+CLIP: Versatile AI image generation based on text input

As I said last year in my tutorial on building your own AI models using Runway, I'm always on the lookout for new AI art tools that can be used to create unique art that feels like "mine," not just like a byproduct of a neural net learning to think. Tools like this are still few and far between, but time marches on, and new AI models, techniques, and model mash-ups constantly appear on the scene, only to be quickly refined, changed, improved, altered.
Many models don't have much wide-range appeal, are unwieldy to implement (especially for all of us non-coders), or are limited in their abilities, scope, and applications for art. Others, like DALL-E by OpenAI, generate a lot of hype (for good reason), but are not publicly available. VQGan+CLIP is none of the above. All you need to operate it and start generating basically anything you can imagine is access to the Google Colab implementation created by @RiversHaveWings, which I'm providing you here. It's in Spanish, but if you're using Google Chrome, the automatic English translation works perfectly well.
Other people have described the background and history of this tool already, and there's a pretty good tutorial available for the colab notebook I linked to above, so I won't delve into either topic here (although the rest of the post is written assuming you've read the linked posts). Instead, let's talk about what VQGAN+CLIP can do.

What Can I Make With VQGAN+CLIP?

Honestly, the better question is "is there anything I can't make with VQGAN+CLIP?" and the answer is yes, probably, but part of the fun of the tool is that you learn as you go what kinds of prompts work best, what kinds of things (aesthetics, shapes, colors) can be paired successfully, and how to coax the kinds of images you want out of the neural net. This process is broadly termed "steering," and there's a bunch of discussion on twitter and elsewhere about how to coax higher-quality images out of VQGAN+CLIP.
Finding a steering phrase that reliably improves output has been described as similar to discovering a magic incantation, and I think that's a really apt analogy. Given enough trials, it really is possible to develop complex, modular prompts that consistently give you really unique images in your desired style--a style it's quite possible to invent yourself through experimentation. In other words, this isn't just AI style transfer like DeepDream.
Below are a list of steering prompts that the nascent VQGAN+CLIP community on twitter and reddit recommend to generate more realistic images:

  • [your prompt here] vray unreal engine hyperrealistic
  • [your prompt here] trending on artstation
  • [your prompt here] photorealistic
  • [your prompt here] render
  • [your prompt here] in the style of disney

These all work quite well for certain kinds of art, but if you want something, say, painterly, not pseudo-rendered, I recommend forgoing all or most of these steering prompts, using the large wikiart or imagenet model, and learning by trial, error, and seeing what other people do on twitter/reddit. Google Colab provides you with replenish-able GPU power, so it doesn't cost you anything to play around with the tool and get a feel for the sheer potential of the thing. With a little time, you can easily create your own secret-sauce prompts to create mind-blowing art.
For example, one of my tailor-made prompts generates stuff like this almost every time:

while another generates images like this:

The tool is versatile as can be, and I'm only just scratching the surface of its potential.

Words have power!

Apart from being technically fascinating and artistically exciting, there's some humor in learning your way around VQGAN+CLIP, too. Because part of the functionality of the tool is is based on the words that people use to tag images with, it can get pretty literal in what it spits out for certain prompts. My favorite mishap by far was when I asked VQGAN+CLIP for "a military portrait of a Homo erectus rear-admiral." I didn't get anything like what I wanted, but I got exactly what I asked for--a Homo erectus admiral's rear.

Another enjoyable too-literal instance was when I asked for a "hard-boiled utopian city" and got an aggressively eggy wonderland for my trouble.

I really want to see what kinds of things people make with this tool, so if you create anything cool, please share! If you have any questions about implementing the colab notebook, please don't hesitate to ask, as well.

Now, go make stuff!

H2
H3
H4
3 columns
2 columns
1 column
3 Comments
Ecency