The text-to-image space is crowded. Instead of using the same model for every task, we decided to evaluate which of the current leading models -- as rated in the Text To Image Leaderboard on HuggingFace – are best suited for specific outputs.
For this comparison, we looked at each model from an everyday consumer as well as a business perspective – all images were generated via public APIs or the vendors' websites, with no special fine-tuning. The contenders were:
- GPT-4o
- Recraft v3
- Imagen 4
- Reve 1.0
- Flux 1.1 [pro]
- Ideogram 3.0
Let's see how they stack up.
Test 1: Stylized Text
We started with a complex prompt that tests stylistic understanding and, notoriously, text generation:
Alien landscape, in the style of classic science fiction book covers from the 1960s, the cover says “was he a developer from the past – or an AI from the future?” – in small letters and “THE RISE OF ZEDGE” as the title, by “Gediminas Vasiliauskas”
GPT-4o |
Recraft v3 |
Imagen 4 |
|
|
|
Reve 1.0 |
Flux 1.1 |
Ideogram 3.0 |
|
|
|
Based on these results, it seems that GPT-4o has the most accurate aesthetic in terms of “science fiction book covers from the 1960s.” However, props to Reve 1.0 for adhering to the letter case, because the prompt had specified the word “was” should be in lowercase.
Test 2: World Understanding
How do these models interpret trivial human situations? We tested a first-person perspective that requires "world understanding." The expectation was to see a dog from the perspective of someone lying in bed:
POV your dog wakes you up
GPT-4o |
Recraft v3 |
Imagen 4 |
|
|
|
Reve 1.0 |
Flux 1.1 |
Ideogram 3.0 |
|
|
|
We see that Recraft v3 has the best world understanding of a POV and waking up in the morning. GPT-4o seems to think you are sleeping on the other side of the bed, while Reve 1.0 got things backward, as the POV seems to be of you finding your dog in your bed while standing over it.
Test 3: Abstract Metaphors
This test checked each model's capacity to translate an abstract concept into a coherent visual that makes sense to a human:
Chaos vs order as a visual metaphor next to an ancient Egyptian temple ritual
GPT-4o |
Recraft v3 |
Imagen 4 |
|
|
|
Reve 1.0 |
Flux 1.1 |
Ideogram 3.0 |
|
|
|
While this is a subjective call, we think Imagen 4 represents “chaos vs order” the best while adhering to the Egyptian thematic. Ideogram 3.0 took the idea very literally (though we’re not sure what’s so orderly about a black sun).
Test 4: Abstract Concept
To minimize bias from concrete objects, we prompted the models with a single, abstract emotion:
Fear
GPT-4o |
Recraft v3 |
Imagen 4 |
|
|
|
Reve 1.0 |
Flux 1.1 |
Ideogram 3.0 |
|
|
|
The “Fear” emotion was best depicted by GPT-4o. However, Recraft v3 and Ideogram 3.0 offered a fascinating take, attempting to showcase the emotion through the eyes of the person viewing the image.
Test 5: Anime Style
Next, we tested one of the most popular and well-defined styles: anime
A wide shot of two anime high school students, a girl with long black hair and a boy with messy brown hair, gazing out the window of a moving train car.
GPT-4o |
Recraft v3 |
Imagen 4 |
|
|
|
Reve 1.0 |
Flux 1.1 |
Ideogram 3.0 |
|
|
|
It appears that Recraft v3 and Ideogram 3.0 best adheres to the prompt, where both subjects are looking out of the window. All models produced superb anime quality, albeit in different styles.
Test 6: Stress Test
Finally, we used a highly detailed prompt to stress-test the models' ability to handle multiple subjects, specific actions, and a distinct historical aesthetic:
Five female riveters perched on a narrow steel beam hundreds of feet above the bustling streets of 1930s Chicago. Wearing their standard workwear of denim overalls and leather gloves, the women take a break from their work on the construction of the first skyscraper, enjoying coffee from metal thermoses and sharing a laugh. The grayscale image, reminiscent of Depression-era photography, showcases their camaraderie and fearlessness against the backdrop of a rapidly modernizing city.
GPT-4o |
Recraft v3 |
Imagen 4 |
|
|
|
Reve 1.0 |
Flux 1.1 |
Ideogram 3.0 |
|
|
|
The GPT-4o result feels AI-generated. People are floating in the Flux 1.1 image. There are only 4 women in the Recraft V3 generation. In almost every case, the women have "generic" facial features and similar builds, except for the Reve 1.0 model. Reve 1.0 also captures the nature of laughter and camaraderie the best.
Unsurprisingly, perhaps, no single model “won”:
GPT-4o: Definitely a go-to model for overall quality and text generation. It consistently delivers stellar results.
Recraft v3: High quality for simple real-world scenes. Users give it high marks for nature and landscape images.
Imagen 4: Pretty good at producing visually appealing results, making it great for conceptual art and digital art.
Reve 1.0: Has a similar vibe to MidJourney. It can generate aesthetically pleasing images. Also use it when you need variety in many similar objects.