Our Image-to-Image AI Journey: Scaling to Millions

By Gediminas Vasiliauskas 5 min read
Our Image-to-Image AI Journey: Scaling to Millions

At Zedge, we faced a common but difficult engineering challenge: how do you build an image-to-image AI pipeline that can transform a user's photo into any artistic style, maintain their identity, and scale to millions of users with low latency?

Not long ago, we stressed over models consistently adding six fingers – nowadays, they're so advanced they could enter human art competitions (and win).

So, how did we go from disappointing results to ones that match or exceed expectations?

Rough Beginnings: Faces That Didn’t Fit

Our goal was to allow users to transform any photo into styles like anime, cyberpunk, or watercolor paintings. The key requirements were speed, quality, and preserving the user's likeness.

When generating AI images, we use specific parameters – settings that guide the AI's behavior. A critical parameter is the prompt, a descriptive instruction that tells the AI what to create. For text-to-image generation, the prompt is purely textual. In our image-to-image scenario, however, we provide two types of prompts: a text prompt describing the desired style and subject, and an image prompt, which is the user's uploaded reference photo.

Initially, we built on the open-source generative AI model Stable Diffusion XL (SDXL), combining it with ControlNets, hoping for magic. ControlNet is a tool that extracts visual information such as outlines or depth information from the user's reference image. This visual guidance helps the AI model better match the structure and composition of the reference image, as illustrated in the outline (Canny) map for the dog or the depth map for Keanu Reeves. Despite this additional guidance, human faces still came out distorted and barely recognizable. Oddly, animal results looked fine – maybe because we’re less particular about pets’ faces?

1. Dog reference image

Dog Canny ControlNet map

Dog result in Anime style

2. Keanu Reeves reference image

Keanu Reeves Depth ControlNet map

Keanu Reeves in Vice style

You might wonder how a model knows which colors or gender to generate? To bridge this gap, we tried the CLIP Interrogator, a tool that automatically generates descriptive captions from a reference photo, providing additional textual context for details. For instance, the original generic prompt might look like this:

"""anime artwork illustrating {{CLIP_output}} anime style, highly detailed"""

After running CLIP Interrogator on the reference dog image, it becomes:

"""anime artwork illustrating a brown and white dog standing on a bench. anime style, highly detailed"""

Usually it worked great, but occasionally we’d get wild mistakes – such as generating a pizza by Keanu Reeves' face in the example above because the proposed caption was "a man sitting at a table with a pizza in front of him."

Therefore, to answer how exactly AI produces these misunderstandings, generative AI models rely entirely on the textual and visual inputs provided. A common example is if the input image is too close-up, a depth ControlNet might produce an unclear depth map, leaving room for interpretation by the AI. Similarly, if the CLIP Interrogator generates an incorrect caption, the AI blindly follows the mistaken prompt.

Adapters: More Hype than Help

Next, we tested IP-adapters and T2I-adapters. IP-adapters, which allow the AI model to incorporate visual details directly from the reference image into the textual prompt, slightly improved faces but degraded overall compositions and introduced artifacts, such as weird hands in the first image below. T2I-adapters, essentially variations of ControlNets, did not appear to offer anything significantly better, as shown in the second image:

Keanu Reeves in Vice style (IP-adapters)

Keanu Reeves in Vice style (T2I-adapters)

LoRAs? Too Slow for Our Scale

Between exploring ControlNet and adapters, we briefly considered using Low-Rank Adaptations (LoRAs). In simpler words, they are retrained (fine-tuned) mini-models of the original AI model on user-provided images. Although the results are significantly better, training LoRAs require the user to upload 3-10 images and the training process takes over a minute. Since we need to provide millions of users with instant, single-image results, LoRAs simply weren’t a viable solution for Zedge.

Photomaker-V1: Almost There

When Photomaker-V1 appeared, we were optimistic -- it finally solved our core problem: faces looked great, and you could actually recognize who the person was without needing the reference image. But images lost their structure. Despite rigorous prompt engineering, we kept getting inconsistent scenes, especially with non-face subjects. It was a hit or miss with the background -- as you see in the example below, it generated a kitchen out of nowhere:

PhotoMaker-V1 output of Keanu Reeves in Anime style

PhotoMaker-V1 output of Keanu Reeves in Vice style

InstantID: The True Game-Changer

Everything clicked after obtaining a license from the company InsightFace and their InstantID add-on for facial detection and recognition. Faces? Perfect. Composition? Fully controllable thanks to ControlNets being a part of InstantID’s architecture. Each style -- whether depth-heavy cyberpunk or pose-driven anime -- became predictably stunning.

InstantID output of Keanu Reeves in Anime style

InstantID output of Keanu Reeves in Vice style

Expanding Versatility: Meet PuLID

For more stylized visuals such as animations, claymation, and pixel art, we introduced the Pure and Lightning ID method (PuLID), which is also powered by the InsightFace's facial detection and recognition models. Simply put, PuLID is very similar to InstantID, minus the built-in ControlNets as part of its architecture. Thus, it’s creative, quick, and visually exciting.

PuLID output of Keanu Reeves in Toy style

Adaptive Pipelines: Smart Fallbacks

A production-ready system must handle edge cases, such as images with multiple faces or no face at all. For those, we seamlessly switch back to trusty ControlNets, avoiding errors and keeping every result coherent. However, group photos remain a challenge. Unfortunately, ControlNets aren't great at generating faces, and both InstantID or PuLID only capture the embeddings of one face. Solving group images remains an area we're actively working to improve.

Where We Stand: Unmatched Combination of Speed, Quality, and Flexibility

Today, InstantID creates incredible images in just 4.3 seconds, and PuLID even faster at 2.2 seconds (mean active time), whereas ControlNets or PhotoMaker-V1 averaged 11.7 seconds. Our custom pipelines give us speed, flexibility, and affordability – superior to costly per-image fees charged by competitors.