Same prompt. Same model. Left panel runs through the SwarmAI framework. Right panel is a plain chat-completion call. Every number shown comes from a real recorded execution — scrub, pause, compare.
CFA-style equity research on a real ticker. Senior analyst uses web-search + calculator tools; writer produces a BUY/HOLD/SELL report. Baseline has no tools and hedges with generic commentary.
Grounded recommendation with live metrics vs. a hedge that can't fetch or compute anything.
Same question answered with vs without retrieval. Left: RagPipeline ingests a curated corpus, retrieves the top-K hybrid (vector + BM25) passages, and synthesises a cited answer with the eval-winning defaults. Right: raw LLM with no retrieval — fluent but ungrounded.
Cited answer pinned to the corpus vs a confident summary that can't reproduce a single specific default value.
Every trace is reproducible. Recordings pin model version, framework git SHA, temperature, and seed — see swarm-ai for the recorder.