Experiments

min read

Fan Out Experiment: How ChatGPT, Gemini, and Perplexity Recommend B2B Alternatives Across 270 Naturalistic Queries

Understanding how LLMs fan out user queries.

Ayomide Joseph

June 19, 2026

Overview

Most of what gets said about AI search in B2B marketing is intuition that we’ve masked up as ‘analysis.’

The strategies typically follow the path of writing for ChatGPT, and optimizing for whatever new gimmick Google throws in the wild for us to hunt.

But almost no one has run the queries to capture what the AI does, and looked at the patterns.

So to do just that, I ran 270 queries across ChatGPT, Gemini, and Perplexity, asking each one for alternatives to three B2B SaaS products: Kustomer, Okta, and LinearB.

Thirty different phrasings of the same intent per product, across three LLMs. Polite, casual, frustrated, openly profane. Role-led, problem-led, comparison-led. I opted for this range to mimic what a real buyer covers during a real evaluation cycle.

Table of contents

Heading 2

Heading 3

How the experiment worked

I chose three products for what their categories represent.

Kustomer sits in a mature customer support market with a settled competitive set.
Okta operates in identity and access management, where one structural incumbent (Microsoft Entra ID) dominates.
LinearB is in engineering analytics, a contested category with no clear winner yet.

The thirty phrasings per product were drawn from a ten-category taxonomy designed to test four dimensions of variation: formality, emotional valence, specificity, and directness.

Direct-polite (“What are the best alternatives to Okta?”).
Direct-casual (“Other tools like Okta?”).
Frustrated-mild (“Okta isn’t working for us, what else is there?”).
Role-led (“Best alternative to Okta for a small IT team”).
Comparison-framed (“Tools like Okta but cheaper”).
Reddit-style (“Anyone else hate Okta lol”).

🔖 Each query was captured in a fresh chat session in a logged-out, private browser. Single-shot, no prior conversation state.

Finding 1: The brand pool converges

When buyers ask an AI search engine for alternatives to a B2B product, the recommendations come from a tight, repeating pool of brands.

Different phrasings produce different ranking orders, but they pull from the same set. As such, the phrasing matters less than the question itself.

The data is visible with Kustomer

In a comparison with Kustomer, across 90 responses, four brands appeared in the top five of almost every response: Zendesk 59 times, Gorgias 58, Freshdesk 57, Intercom 50.

The fifth-most-frequent brand was Gladly at 26—a sharp drop. By the time the list runs out of brands appearing more than 15 times across 90 captures, only seven companies have been named for the entire category.

Okta shows the same shape

Microsoft Entra ID appeared in the top five of 63 responses. JumpCloud 58. OneLogin 45. Ping Identity 43. Auth0 34. Five brands, almost every time.

LinearB is the same story

Swarmia 63, Jellyfish 62, Waydev 44, Haystack 32, DX 28.

The concentration sharpens at the #1 position. Microsoft Entra ID took the top spot in 71% of Okta queries where a #1 was named. Zendesk took 55% of Kustomer #1 spots. LinearB is more contested as Jellyfish took 47%, and Swarmia 29%, but the pattern still holds.

The brands inside the consensus pool also overlap heavily across LLMs.

For Okta, all three LLMs picked Microsoft Entra ID as their #1 most often. For Kustomer, all three picked Zendesk. Three different models, built on different training data, retrieving differently, arrived at the same answer.

The implication of this splits things into two aspects: AI search has an awareness layer, and brands either live inside it or they do not.

Brands inside the consensus pool show up across most phrasings, across most LLMs, across most query types. Brands outside the pool show up rarely, deep in responses, or only when the buyer's phrasing is unusually specific. No phrasing trick lifts a brand from outside the pool into the consensus position. The AI is drawing from training data and retrieved sources that already reflect a settled view of the category.

📋 Thesis

For brands inside the pool, the problem is defense. The position depends on continued accumulation of the signals that put the brand there. Slip on any of these and the consensus position erodes.
For brands outside the pool, the problem is harder. You can’t win this with “clever” optimization of existing buyer intent. AI search has already decided what to recommend for that intent. The path in is through building the underlying signal infrastructure that puts a brand into the pool to begin with. That takes longer than most teams budget for.

Finding 2: Specificity breaks convergence. Emotion does not.

I came into the experiment expecting emotional phrasings like frustrated, profane, and Reddit-voiced to surface different brands than calm professional phrasings.

The conventional thinking says the AI matches the buyer’s emotional register and pulls forward different recommendations. But the data argues the opposite.

Okta is a good example of this. Across Direct-casual, Frustrated-mild, and Frustrated-profane phrasings—nine queries in each category—Microsoft Entra ID came back as the #1 recommendation 100% of the time. Polite, mildly annoyed, openly profane also had the same answer.

Kustomer also behaved the same way. Zendesk took the top spot in 78% of Direct-polite phrasings, 78% of Frustrated-mild, and 78% of Frustrated-profane. The buyer’s tone is invisible to the recommendation engine.

Although there is one exception in the emotional band. Reddit-style phrasings (“Anyone else hate Kustomer lol”) produced almost no brand recommendations at all.

Across the Okta and LinearB Reddit-style queries, zero brands were recommended in any of the 18 responses. The LLMs read the question as venting rather than purchase intent and responded in kind. ChatGPT and Gemini for example validated the frustration, and explained what users complain about, without ever recommending alternatives.

📋 Thesis: Reddit-voice content aimed at AI search visibility may earn cultural alignment with the buyer without ever positioning the brand inside a shortlist. The community-presence playbook still has value, but the value is brand affinity rather than recommendation visibility.

There’s also a contrast that appears the moment specificity enters the query. Phrasings that anchor to a use case, a role, or a comparison frame fragment the recommendation immediately.

Comparison-framed phrasings: “Tools like Kustomer but cheaper,” “Like Okta but with better UX” — produced seven different #1 brands across Kustomer, six across Okta, and seven across LinearB. The pool of competitors is the same as it was for the polite phrasings. The ranking inside the pool fractures.

Role-led phrasings show the same effect with different reshuffling. For Kustomer, the question “Best alternative to Kustomer for a small CX team” surfaced Help Scout as the top recommendation. The same question framed for enterprise compliance surfaced Salesforce Service Cloud.

Zendesk, the category leader at 55% of #1 spots overall, slipped to second or third in both. The buyer’s context—e.g., small team, enterprise scale, specific compliance posture— pulled different brands forward from the same underlying pool.

The strategic implication of this also splits two ways:

‍Content strategies that match the emotional register of the buyer optimize for an intent that AI search has already collapsed. The recommendation surfaces the same brand whether the buyer is calm or furious. The work has cultural value and brand-affinity value, but it carries no recommendation-level leverage.

‍Content strategies that match the contextual specificity of the buyer optimize for an intent that AI search has not yet decided. As such, the recommendation fragments. There is still room to be the brand that surfaces when the buyer says “for a small IT team,” even when a different brand surfaces for the broader phrasing.

📋 Thesis: Emotional resonance is a brand investment while contextual specificity is a recommendation investment. Most B2B SaaS teams under-invest in the second and over-invest in the first, because the first is easier to write and the second requires understanding buyer segments deeply enough to model them in content. The data here argues for rebalancing.

Finding 3: The three LLMs are not interchangeable

A lot of B2B marketing strategies treat “AI search” as a single optimization target. It postulates the idea that if you get good at it, the work pays off across ChatGPT, Gemini, Perplexity, and whatever comes next.

While it’s not a bad idea, the data argues this framing is wrong.

The three LLMs studied behave like three different products, and the differences are big enough that a team optimizing for one will leave visibility on the table with the others.

Retrieval behavior

The first and largest gap is in how often each LLM actually searches the web before answering.

Perplexity searches on every single query (90 of 90 captures).
Gemini searches on 74% of queries. ChatGPT searches on 44% of queries.
The remaining 56% of ChatGPT responses came from training data alone, with no retrieval step at all.

A ChatGPT answer drawn from training data is the model summarizing what it learned during training. That training data was scraped from the open web months or years before the user typed the question.

The brands ChatGPT will recommend from training data are the brands that had visibility during its training window. New entrants, recent rebrands, and emerging competitors are systematically underweighted in training-data responses.

The pattern of when ChatGPT decides to search the web is what sharpens the finding. Direct-polite queries fired web search 100% of the time.

Frustrated-mild fired search 89% of the time.
Problem-led queries triggered search only 33% of the time.
Outcome-led 22%.
Comparison-framed 22%. Role-led 11%.
Reddit-style queries did not trigger a single web search across nine attempts.

📋 Thesis: I believe ChatGPT runs a routing decision before answering:

When the query is a clean retrieval task, it searches.
When the query requires interpretation, sentiment-matching, or contextual reasoning, the routing prefers training data

I presume this is because the model believes its trained intuition is more reliable than fresh search results for that kind of question.

For ChatGPT visibility on direct alternative-seeking queries, fresh web content matters because ChatGPT will retrieve and cite it.
For ChatGPT visibility on contextual queries (the kind buyers actually use during real evaluation) the only path in is being part of the training data ChatGPT already has.

That makes long-term brand mentions across the web, persistent presence in established publications, and accumulated SEO signal far more important for ChatGPT than for the other two LLMs.

Gemini and Perplexity behave differently because they search more reliably. Gemini still has a 26% training-data fallback, with a milder version of the same intent-shape effect. Perplexity has no fallback to manage.

Citation density

The second gap is in how many sources each LLM cites per response. ChatGPT averaged 1.7 sources per query, Gemini 2.2, and Perplexity 4.2. The gap also widens within specific products. For Kustomer, ChatGPT cited 2.8 sources per query, Gemini 1.9, and Perplexity 4.9.

For B2B teams, the conversation here moves to leverage.

A single mention on a high-authority source—e.g., a Gartner peer review, or a citation in a frequently-pulled listicle—has outsized influence on what ChatGPT recommends, because ChatGPT cites so few sources per query that each one carries more weight.

On Perplexity, the same single mention is diluted across three or four other citations.

Brand pool breadth

The third gap is more interesting because it goes against the intuitive read. The instinct is to assume the LLM with the most retrieval and the most citations would also surface the widest universe of brands. Again, the data does not support that.

For Kustomer, Perplexity surfaced the widest brand pool with 25 unique brands in the Top 5 alone, 32 when “Other brands mentioned” are counted. ChatGPT was tighter at 18/24. Gemini tighter still at 17/18.

For Okta and LinearB, the order flips. ChatGPT surfaced 35 unique brands for Okta when including the long tail, and 38 for LinearB—more than either of the others. Perplexity was the narrowest of the three for both products.

Key note: When ChatGPT falls back to training data, it produces longer, more discursive answers that name more competitors in passing. And for Perplexity, it produces a tighter, citation-anchored response with a clear top group and fewer also-rans. The shape of the answer changes the breadth of the brand surface.

📋 Thesis: For challenger brands trying to land mentions in any position, this is important. ChatGPT in training-data mode is more likely to name a long-tail competitor in passing — i.e., the brand earns awareness without earning a recommendation. Perplexity offers no equivalent. A brand either earns its way into the cited group or it does not appear at all.

The three LLMs require three optimization profiles.

For ChatGPT, the priorities are training-data presence and high-authority citations.
For Gemini, vendor-blog discoverability and review-aggregator presence.
For Perplexity, citation breadth across diverse source types.

A team that builds for only one of these will see uneven results across the others.

Finding 4: Source mix reveals LLM identity

A recommendation engine is the sum of the sources it pulls from. Two LLMs can converge on the same brand for the same query, but if the sources backing that recommendation are different, the strategic implications are different too.

The three LLMs draw from materially different source mixes, and the differences hold across every product in the study.

Across all 270 captures, the LLMs collectively cited 730 sources. G2 and Reddit were the two most-cited domains (81 and 70 citations respectively). After those two, the long tail diverges sharply by LLM.

ChatGPT is a listicle and forum engine

Of ChatGPT’s 155 citations, 33% came from listicle and affiliate content. 21% from review aggregators. 19% from forums (almost all Reddit). 16% from vendor sites.

The top ChatGPT-specific citations after G2 and Reddit were sites like Ringly, RankRed, and Pandev which are listicle-format affiliate publications.

Gartner appeared, but with only 7 citations across all 90 queries. The Tier 1 analyst presence was thin.

For B2B teams optimizing for ChatGPT visibility, placement in the top-ranking listicles in the category is the most direct path to citation. Listicle publishers accept inclusion requests, sponsored placements, and earned coverage.

Gemini is a vendor blog engine

Of Gemini’s 199 citations, 51% came from vendor and brand sites. Review aggregators were 8%. Forums 4%. Listicles 16%. Half of everything Gemini cites is sitting on a vendor’s own owned domain.

What a competitor’s blog says about its category, its differentiators, and its comparison ecosystem becomes the basis for what Gemini will tell a buyer about every competitor in the category, including yours. The competitive blog has become a direct participant in Gemini’s recommendation logic.

A team writing strong “X vs Y” content on its own domain is feeding the model that will get asked about both X and Y. A team that has not invested in this layer is letting competitors write the model’s training signal.

Perplexity is the breadth engine

Perplexity cited 376 sources across 90 queries, more than ChatGPT and Gemini combined. Its mix was the most balanced of the three: 41% vendor sites, 18% review aggregators, 10% forums, 10% listicles, 4% social and video.

Perplexity cited G2 51 times across the dataset, more than twice ChatGPT’s 24 and three times Gemini’s 16. It cited Reddit 33 times. It was the only LLM in the study that consistently cited YouTube. Its source mix is closer to a search engine results page than to a curated authority shortlist.

Strategically, Perplexity rewards visibility breadth. Because it cites four times more sources per query than ChatGPT, no single source carries the same weight.

📋 Thesis: The path to influencing Perplexity recommendations runs through being present across more places.

Teams that have invested in citation breadth across the category ecosystem see disproportionate returns on Perplexity.
Teams that concentrated their PR strategy on a small handful of authority placements see less.

The uncomfortable read across all three LLMs is that for Gemini and Perplexity, one of the most influential source categories is the competitor’s own content. For most B2B SaaS teams, competitors are influencing the AI’s view of your brand far more than the team has accounted for.

The only way to win is to participate at the same level. In this case, creating own-domain editorial depth, competitive comparison content, and category-defining writing on the topics buyers will eventually ask the AI about.

Finding 5: Category structure shapes how locked-in AI search becomes

The first four findings apply broadly. The pool converges, specificity breaks ranking, LLMs behave differently, source mix matters. These patterns held across all three products.

The fifth finding is about what differed.

The three verticals produced three distinct lock-in profiles:

Kustomer: 43-brand pool, Zendesk holds 55% of #1 spots
Okta: 55-brand pool, Microsoft Entra ID holds 71% of #1 spots
LinearB: 61-brand pool, Jellyfish holds 47% of #1 spots

The thought process that a bigger pool means more fragmentation is wrong.

Okta has the second-largest pool and the highest concentration. LinearB has the largest pool and the lowest concentration. Kustomer sits in between on both axes. Pool size alone fails to explain lock-in but category structure does.

Okta

Okta sits in a category with a structurally dominant incumbent. Microsoft Entra ID is the default IAM solution for the largest segment of buyers — i.e., enterprise organizations already running Microsoft 365 or Azure. That dominance is reflected in every signal the LLMs draw on.

As such, any time a buyer asks about Okta alternatives, all three LLMs converge on the same answer with high confidence. There is room in the consensus pool for JumpCloud, Auth0, Ping Identity, and others, but there is no room at the top.

Kustomer

Kustomer sits in a category with a clear leader and no incumbent of Microsoft Entra ID’s structural weight. Zendesk takes #1 55% of the time. The remaining 45% spreads across thirteen different brands.

The customer support category has been mature long enough that multiple credible vendors have accumulated their own visibility signals. Buyers get Zendesk most of the time, but specific framings like “for ecommerce,” “for SMBs,” “for enterprise” pull different brands forward. The category is led but not locked.

LinearB

LinearB sits in a category that has not yet decided who wins. Jellyfish takes #1 47% of the time, Swarmia takes 29%, and the choice between them depends heavily on which LLM the buyer asks. ChatGPT favors Swarmia. Perplexity favors Jellyfish. Gemini splits roughly evenly.

Beyond the top two, six more brands held the #1 spot at least once.

📋 Thesis:: what next?

For incumbents in Okta-shaped categories, the strategic priority is defense at the top spot. The lock-in is high but not necessarily permanent.
For incumbents in Kustomer-shaped categories, the priority is contextual defense. Zendesk holds the overall lead, but specific buyer contexts pull different brands to the top. The path to losing share is the steady accumulation of contextual leadership by smaller vendors in segments the incumbent has not defended well—e.g., Gorgias being the go-to brand for ecommerce or shopify stores.
For challengers in LinearB-shaped categories, the opportunity is the largest of the three. The category is still unsettled and contested. The brands that build the strongest signal infrastructure over the next 18 to 24 months will be the ones the LLMs converge on. The gap between the current leaders and the long tail is narrow enough that coordinated investment could move a brand from the long tail to the consensus pool within a planning horizon that teams actually have.

The framework matters because there is no universal playbook for winning AI search. There are three playbooks. Where a brand sits in its category determines which one applies, and the work required to move position is different in each.

What this means if you work in B2B SaaS

Five findings, taken together, argue for a shift in how B2B SaaS teams think about content and category positioning.

AI search has an awareness layer that brands either live inside or do not. The first question for any B2B team is whether the brand is in the consensus pool for its category. Everything else is downstream of that question.

Within the pool, the recommendation is moved by contextual content. The teams that build deep use-case, role-led, and comparison-framed content for the segments where the AI has not yet converged will earn ranking visibility that emotional content cannot reach.
The three LLMs are not interchangeable.
- ChatGPT requires training-data presence and high-authority citations.
- Gemini requires owned-domain depth and competitive comparison content.
- Perplexity requires breadth across the citation ecosystem.

A single “AI SEO” budget that treats them as one channel will produce uneven results.

Key note:: The source layer is more vendor-driven than the industry has acknowledged. In two of the three LLMs studied, one of the most influential source categories is the competitor’s own content. And category structure determines which playbook applies.

Locked categories require sustained, multi-year campaigns.
Led categories require contextual defense.
Contested categories present the largest opportunity for challengers, but the window is narrowing as the underlying signals consolidate.

Get the full report

The full report includes the complete methodology, the phrasing matrix for all three products, the source-mix analysis broken down by LLM, the category-structure framework, and a chapter on the strategic implications for B2B SaaS teams.

If the patterns above match what you are seeing in your own work, or if they contradict it sharply enough to be useful, the full report gives you the underlying data and the framework to act on it.

Download the Fan-Out Experiment Report →