Achieving a 100% safe platform is impossible. No single platform or product has managed it, and none ever will. A dedicated user with malicious intent will always find a way to bypass guardrails. The goal, then, becomes making such efforts so time-consuming that he gives up and moves on.
As a practical example at Zedge, users sometimes generate problematic images from text prompts with our AI tool. Roughly 7% of user-generated AI images are blocked as Not Safe For Work (NSFW). Users expect instant results, but we must first ensure they haven't asked for something we can't legally provide, such as sexual content involving minors.
The Detection Problem
Here's where it gets interesting: how do you detect something that seems innocent to AI models but is extremely sensitive to humans? Certain poses, angles, clothing choices, or facial expressions can carry implications that basic content filters miss.
There is no single solution.
In our practice, we employ a two-layer security solution: 1) verify output image, and 2) verify user prompt patterns.
Two-Layer Security Approach
Why do we need both layers? While people assume NSFW exclusively implies sexual content, we must account for multiple sensitive categories, which is a non-exhaustive list:
- Sexual content
- Child Sexual Abuse Material (CSAM)
- Terrorism
- Hate speech and harassment
- Content promoting suicide
- Sensitive political and historical symbols
Layer 1: Verifying the output image layer consists of two checks, one for sexual content and another for sensitive content.
For the first process, we have investigated many leading open-source NSFW checkers, including: Diffusers Safety Checker, FalconsAI Checker, Nudenet Checker 640M, Bumble’s Private Detector, Yahoo OpenNSFW2.
In our testing, the Diffusers Safety Checker achieved the best balance with an F1-score (the harmonic mean of model’s precision and recall) of approximately 0.67 for NSFW detection. While other checkers such as Falconsai showed higher precision (~0.8), they suffered from lower recall, missing many actual NSFW instances.
These models can detect sexual content, but they struggle with other categories.
For the second check, a Vision Language Model (VLM) comes into play. We employ the 8 billion parameter variant of InternVL 3 to ensure output images don't contain any prohibited content. We chose this model for its accuracy and speed (less than 1 second!). This visual language model provides an additional safety net beyond traditional image classifiers. There are many other VLM models out there, as seen in the chart below:
However, the most critical part is prompt engineering to get the model to work as expected.
Worth noting: if avoiding sensitive content is your priority, choose your model carefully. As an example, HiDream-L1 is completely uncensored, whereas SDXL has inherent restrictions because its creators removed sensitive training data.
Unfortunately, there is one specific category that these models struggle with: detecting CSAM intent.
Layer 2: verifying user prompt intent, is the final check for ensuring safety.
Since some users genuinely lack malicious intent, even when their prompts might lead to prohibited results, we analyze patterns across multiple generations.
A Large Language Model (LLM) proved to be the perfect solution because traditional Natural Language Processing (NLP) solutions lack a general semantic understanding when it comes to CSAM detection across all languages. They also struggle with the complexities of spoken language, where many people say "girl" when referring to an adult woman, for example
Users who deliberately misuse our platform typically generate multiple problematic images, so it's easier to determine malicious intent for a given image if we can analyze multiple prompts from that particular user.
After some experimenting, we chose to use the Gemini 2.0 Flash model to scan user prompts with carefully engineered system prompts. Google themselves recommend this approach for content filtering and moderation. To deploy it on scale, we use the Vertex AI Platform API.
It roughly costs $100 to scan 1M users with Gemini 2.0 Flash.
The Ongoing Battle
Even with these comprehensive guardrails, determined users will attempt jailbreaking and go exploit hunting. However, by implementing multiple detection steps, we ensure malicious actors must invest significant time and effort to achieve prohibited results. Most give up when the investment exceeds the reward.
The key insight from our journey: content safety isn't about building an impenetrable wall -- it's about creating enough friction that bad actors choose easier targets. Safety in AI generation remains an evolving challenge. As models become more sophisticated, so do the attempts to misuse them. Our multi-layered approach represents the current state of the art, but we continue researching and adapting as the landscape changes, as should you.