Evaluating GPT-4’s Image Generation Capabilities

Do the results really warrant the hype

Devansh
3 min readOct 18, 2023

Experimenting with AI Generated Pictures for an upcoming piece. I’ve known this, but experimenting with this stuff really shows you how overrated GPT-4’s multi-modality is.

My prompt for the image below is- Draw a bunch of geometrically similar rectangles nested within each other. The Biggest Rectangle has the text “main problem”, the second biggest has “Sub Problem one” etc.

Here are 2 major flaws with it-

  1. Clearly, these are not nested rectangles. This is nowhere close to what I described (and notice that my prompt is extremely simple).
  2. There are lots of typos in there.

Once GPT-4 became multi-modal, the hype-cycle came back in full swing. However, after looking through the capabilities, it doesn’t seem to be nearly as good as advertised. Even extremely basic prompts trip it up, revealing how far things must go before it becomes useful at scale.

One of the most interesting mistakes that it made consistently: flipped the relevant captions for Sub-Problem and Main Problem. Note how main problem is in the inner rectangle. It made this mistake consistently. I would love to know whether this is a common occurence, or just happened to be something associated with my seeding (I’m taking a shot in the dark here).

Another interesting phenomenon- GPT really struggles with generating 2-D pictures. Even when it’s explicitly prompted. I reran the prompts multiple times, and most were 3-D Images.

That being said, GPT looks like it has really improved it’s Understanding capabilities. Ran a few basic tests to see if GPT could describe images/withstand adversarial attacks and so far- and it did pretty well. Will post more details on that soon.

Given the current state of GPT, the most promising use-case for Gen-AI is data annotation. It might also have some promise in video compression, where multi-modal models split videos into the frames that are most different, transmit those frames, and another model reconstructs those frames client-side. The dialogue/transcript can be used for additional context.

What do you think? Does the idea sound feasible? How do you see Gen AI being useful? Drop your thoughts below.

If you liked this article and wish to share it, please refer to the following guidelines.

If you find AI Made Simple useful and would like to support my writing- please consider becoming a premium member of my cult by subscribing below. Subscribing gives you access to a lot more content and enables me to continue writing. This will cost you 400 INR (5 USD) monthly or 4000 INR (50 USD) per year and comes with a 60-day, complete refund policy. Understand the newest developments and develop your understanding of the most important ideas, all for the price of a cup of coffee.

Support AI Made Simple

Reach out to me

Use the links below to check out my other content, learn more about tutoring, reach out to me about projects, or just to say hi.

Small Snippets about Tech, AI and Machine Learning over here

Free AI Newsletter- https://artificialintelligencemadesimple.substack.com/

My grandma’s favorite Tech Newsletter- https://codinginterviewsmadesimple.substack.com/

Check out my other articles on Medium. : https://rb.gy/zn1aiu

My YouTube: https://rb.gy/88iwdd

Reach out to me on LinkedIn. Let’s connect: https://rb.gy/m5ok2y

My Instagram: https://rb.gy/gmvuy9

My Twitter: https://twitter.com/Machine01776819

--

--

Devansh

Writing about AI, Math, the Tech Industry and whatever else interests me. Join my cult to gain inner peace and to support my crippling chocolate milk addiction