Why Testing More Creatives Leads to Better Performance Phiture

scaling creatives

How many creative variants does it take to find a winner? For most mobile growth teams, the honest answer is: more than they’re currently testing.

The volume of creatives being produced and tested across the mobile industry has increased dramatically over the past two years. AI-powered tools have made it possible to generate hundreds of visual variants in hours rather than weeks, and the teams using them are pulling ahead. This shift has led to producing more creatives, undoubtedly, but it has also led teams to consider what happens when you test more: you find more winners, and those winners build on each other into a real and measurable performance advantage.

This is one of the core arguments our CEO Andy Carvell makes in his recent piece on integrated growth: when signal loss limits audience targeting, and AI makes experimentation radically cheaper, creative becomes the primary way to drive growth. Not creative quality alone, but creative volume, speed, and the systems that connect learnings across channels.

In this article, we’ll look at how creative testing volumes have changed, why the math favours teams that test at scale, and what this means in practice for how mobile growth teams should work. We also asked Stuart Miller, Phiture’s Creative Director, to share his perspective throughout the piece.

How many creatives does it take to find a winner

The relationship between testing volume and winning creative is probabilistic. If the success rate for any single creative experiment sits somewhere between 5% and 15% (a range that holds across most categories and channels), then the number of winners you find is directly proportional to the number of experiments you run.

In practice, it may look like this: a team running 10 experiments per month at a 10% hit rate will find roughly one winner, while a team running 50 will find five. Those additional winners build on each other, because each winner generates data about what works, which sharpens the next round of hypotheses, which gradually improves the hit rate itself.

This is why creative testing is fundamentally a volume game. The goal isn’t to produce one perfect ad or one perfect screenshot; rather, it is to run enough experiments that the probability of finding winners works consistently in your favour. Teams that understand this and build their operations around it will outperform teams that spend weeks perfecting a single variant before putting it live.

The implication is counterintuitive for many teams: you don’t need better creatives as much as you need more bets. That means that a rough variant that gets tested is worth more than a polished variant that sits in a design queue.

Stuart sees it slightly differently, though. As he puts it: “Volume-first doesn’t mean volume-only. Ignore that and you’ll end up creating mediocre ads. More than ever, creative teams have to deeply understand the product’s value to the target market, not just how to create slick layouts and punchy sentences. Craft has moved to earlier in the process; it’s not about individual executions anymore, it’s about hypothesis quality and systems flexibility.”

Why producing more creatives isn’t enough

For years, the limiting factor in creative testing was production. In essence, it makes sense: you couldn’t test what you couldn’t build, and building creative variants was slow and expensive. AI has largely removed that constraint, given that most UA teams now produce 60 to 80 new creatives per month. In Q4 2025 alone, advertisers used Google’s Gemini to generate nearly 70 million creative assets inside Performance Max and AI Max campaigns, a threefold increase year over year. On the ASO side, Apple’s expansion of Custom Product Pages to 70 per app and Google’s native Store Listing Experiments have opened up far more room for testing than existed even a year ago.

Rather ironically, though, removing the production bottleneck has revealed a new one: getting creatives into testing. Each creative variant needs proper setup: naming conventions, ad set assignment, budget allocation, and conversion tracking. What might seem like five minutes per creative adds up fast, and at 60 to 80 assets per month, that’s hours of work just to get things live.

What can teams do here? Well, simply put, they have to prioritize by testing the obvious bets and leave the rest on the shelf. The result is that many teams are now producing more than they’re testing, which means the investment in production is being wasted.

Here, automation may come to the rescue too. When the entire cycle from variant generation to deployment to analysis runs without manual intervention, teams can actually test at the volume the math demands. In practice, that looks like what happened when Wildlife Studios started automating their store listing experiments: we ran 168 tests simultaneously across their portfolio, 322 variants live and collecting data. The remarkable thing was that results came back in weeks rather than months. The winning experiments drove conversion rate improvements of up to 24.1%, which added up to 12 million projected additional installs. None of that was possible when a human had to set up each experiment by hand. PressPlay, the tool behind that process, handles generation, deployment, monitoring, and early stopping of ineffective creatives automatically on Google Play.

Similar results have been seen with Lockwood Publishing’s Avakin Life (57% conversion rate increase) and Lion Studios by AppLovin (20% global install lift), both using PressPlay to test at volumes their teams couldn’t have managed manually.

On this point, Stuart offers a useful analogy: “In the same way that the printing press rendered the neatness of handwriting less important, modern automations make pixel perfection a far smaller factor than before. The most valuable skill that creative teams can hone is to quickly identify emerging patterns using data, whether that’s a type of hook, an aesthetic, an approach to defining value, and then rapidly fold those back into the testing plan.”

How privacy changes made creative testing the new targeting

There’s a more fundamental reason why creative volume matters more now than it did three years ago. Privacy changes, particularly Apple’s App Tracking Transparency framework, have significantly reduced the precision of audience targeting on paid channels. Depending on how it’s measured, ATT consent rates sit somewhere between 9% and 35% globally, and for many app categories the effective tracking rate is far lower. The granular audience segments that performance marketers relied on for years are largely gone.

As Andy puts it, when targeting signals disappear, the creative itself becomes the targeting mechanism. A video ad naturally attracts the people who care about that use case and filters out those who don’t. This means each creative variant is effectively a bet on who might respond. The more variants you test, the more audience segments you’re exploring.

This reframes creative testing from a conversion exercise (making the store page or the ad perform better) into a way of finding new groups of people who respond to your product when it’s presented in a specific way. That shift has enormous implications for how much testing is “enough.” In the old model, you tested until you found the best-performing version. In the new model, you test continuously because every new creative direction is a new audience opportunity.

Here, Stuart pushes the thinking further on what makes a variant genuinely worth testing: “A dangerous way to approach variants is to think of them as one-dimensional changes. Switch out the background colour, double the size of the CTA, add more lifestyle imagery. While all of those examples can be valid parts of testing, it’s infinitely more valuable to think about the ‘why’ behind any variant. Is your variant probing a different user motivation? A different use case or social context? A new creator type? Is the framing more utilitarian, or emotional? The differences can be fairly subtle, a single image can work with a huge range of narratives. Think less ‘how different is this?’ and more ‘how is this different, and why?'”

He adds: “A smart concept means one in which each variant can tell us something useful, something more than a surface-detail preference. This is also how we think about a good hypothesis: don’t just look to improve your conversion rate, look to learn something.”

Why creative testing fails when teams don’t share learnings

Most teams think the biggest waste in creative testing is producing variants that don’t win. But the real waste is producing winners whose learnings never leave the channel they were tested in.

This is a pattern we see constantly here at Phiture. Here’s a classic example: the ASO team tests a new screenshot layout and finds that showing social proof in the first frame increases conversion by 8%. Then the paid team, running Meta ads for the same app, never hears about it and they’re testing their own hypotheses independently. Six weeks later, they stumble onto the same insight through their own testing. That’s six weeks of duplicated effort and missed learning.

In an integrated growth model, creative testing becomes a shared system. A winning store listing hook informs what the paid team tests in ad creatives. A high-performing ad angle tells the ASO team what messaging resonates with incoming traffic. Review sentiment data shapes the next round of creative hypotheses across both channels. The gains don’t just come from running more experiments. They come from connecting the experiments so that every channel makes every other channel smarter.

Andy gives a concrete example from our work: for an aircraft combat game, PressPlay’s automated testing discovered that icons with significantly more visual detail drove higher conversion. That finding didn’t come from a design team’s intuition. It emerged from running hundreds of experiments across markets and letting the data surface patterns that no individual could have spotted.

Stuart offers a nuanced take on how this actually works in practice: “The most crucial thing to get right is to transfer the insight, not the execution. The same underlying concept of social proof is going to look markedly different on TikTok than it is on an App Store screenshot. A winning insight should travel across channels, but it should adapt to the language and behaviour of each platform rather than being copied literally.”

He also flags a challenge that any growing organisation faces: “Getting insights to flow seamlessly across an organisation is a problem that businesses inevitably face once they pass a certain size, and this is compounded by the speed with which we’re expected to react in the age of vibe coding and agentic AI. Some teams might do that with centralised learnings documentation, others might do it with an intentional mix of disciplines. However, no matter how efficiently you share insights, it won’t matter if the insights themselves are trash, and that means designing better experiments.”

What this means for how teams should work

The practical takeaways are clear, but they require real changes in how most teams operate. We suggest the following:

Redefine what “enough” testing looks like. If you’re running fewer than 20 creative experiments per month across your store listings and paid campaigns, you’re almost certainly leaving winners undiscovered. The math is simple: more experiments mean more winners. Set a minimum experiment volume target the way you would set a spend target or a revenue goal.

Invest in the setup and deployment side, not just production. AI has solved the production bottleneck. The next constraint is getting creatives live and collecting data. Whether through internal automation or tools like PressPlay that handle the full cycle, the priority is reducing the time between “variant is ready” and “variant is live.”

Connect your channels. Creative learnings from ASO should inform paid, and paid should inform ASO. If your organic and paid teams aren’t sharing creative performance data at least weekly, you’re running experiments in parallel instead of in sequence, and you’re missing the stacking effect that makes scale worthwhile.

Shift your team’s focus upstream. When AI handles generation, deployment, and analysis, your creative and growth teams should spend their time on strategy and hypotheses. Which user motivations haven’t we tested yet? Which competitor positioning leaves a gap we could fill? What did last month’s winners have in common? The teams that use AI as a thinking multiplier, rather than a thinking substitute, will be the ones that pull ahead.

Stuart closes with a thought on what this shift means for the future of creative work: “As the tools become faster, cheaper, and more accessible, we’ll see a drift towards creative convergence. AI learns from what is out there, and if we’re all cooking with the same ingredients all the time, the soup is going to start tasting pretty monotonous. The role of the creative team then becomes recognising and choosing what feels human, protecting distinctiveness, and finding ways of transfusing surprise back into the system.”

He adds: “AI can help teams produce and test massive amounts of assets, but it still can’t decide what’s culturally interesting or emotionally resonant. That remains the role of the creative team.”

As Andy also argues in his integrated growth framework, the performance gap is widening between teams that build AI-enabled growth systems and teams that remain stuck working channel by channel. Creative testing at scale is where that gap shows up first, and where it grows fastest.

What’s next

Creative testing is no longer a periodic activity. It’s a continuous system that should be running across every surface where your app meets a potential user, from the app store to paid channels to CRM. The teams that treat it this way, and that connect their learnings across channels, will find more winners, learn faster, and build a performance advantage that’s difficult for competitors to replicate.

At Phiture, we’ve built this thinking into our tools and our way of working. PressPlay handles automated creative testing on the app stores, and Catchbase optimizes paid acquisition spend on Apple Search Ads. Our integrated growth approach connects them so learnings flow across channels rather than dying where they started. If you want to explore what creative testing at scale could look like for your app, get in touch.

FAQ

How many creatives do you need to test to find a winner?

It depends on the hit rate, which typically sits between 5% and 15% across most categories. At a 10% success rate, you’d need to test roughly 10 variants to find one winner. The more you test, the more winners you find, and each winner generates insights that improve the success rate of future experiments.

What is creative testing at scale?

Creative testing at scale means running a high volume of creative experiments continuously across your app store listings and paid campaigns, rather than testing a handful of variants periodically. It typically involves using AI to generate variants and automation to handle deployment, monitoring, and analysis.

Why is creative testing more important after Apple’s ATT changes?

Apple’s App Tracking Transparency framework significantly limited audience targeting precision on paid channels. With fewer targeting signals available, the creative itself has become the main way to reach and attract specific audiences. More creative variants means more opportunities to find what resonates with different user groups.

How does AI change creative testing for mobile apps?

AI has removed the production bottleneck by making it possible to generate hundreds of visual variants quickly. But the bigger change is in deployment and analysis. Tools like PressPlay automate the full testing cycle on the app stores, from generating variants to deploying experiments, monitoring results in real time, and stopping underperformers early. This makes it possible to test at volumes that manual teams cannot match.

What is the difference between testing more creatives and testing better creatives?

Both matter, but volume tends to have a bigger impact than individual quality. A rough variant that gets tested will generate useful data regardless of whether it wins. A polished variant that sits untested generates nothing. The most effective teams focus on producing enough variants to let the probabilities work, and they use the data from each round to make the next round sharper.

How should creative teams work with growth and paid teams on testing?

The most effective setup is one where creative learnings are shared across channels regularly. When the ASO team discovers that a certain visual approach improves store conversion, that insight should inform paid ad creative, and vice versa. Weekly creative performance reviews that include organic, paid, and creative team members help ensure that learnings stack up across channels rather than staying siloed.

Table of Contents