From Prompts to Partners: The Limits of Unlimited Agentic AI

By Gediminas Sadaunykas 5 min read
From Prompts to Partners: The Limits of Unlimited Agentic AI

In Setting Destination, we pictured Zedge‑style “smart sidekicks” co‑creating, curating, and enriching.(Zedge Blog) This follow‑up is the pragmatic sequel: how agentic systems are actually built, where they shine, where they stall, and how they’re evolving - without losing the curiosity that got us here.

Agentic AI = systems that plan, act with tools, remember, and adapt. The ecosystem now ships this out of the box: OpenAI’s Responses API + Agents SDK add first‑party tools (web search, file search, computer use), session state, handoffs, and tracing; Anthropic’s Computer Use lets Claude operate the desktop when APIs don’t exist. These aren’t just chatbots; they’re action systems with audit trails.

The Evolution of Agents (2023 → 2025)

That arc echoes the design framing in Stop Prompting, Start Designing: 5 Agentic AI Patterns That Actually Work - but below we go deeper on how to ship each pattern. DEV Community

Patterns that actually work (flows + build tips)

1) ReAct — Interleave reasoning with tool‑use

Default for retrieval/browsing/short, multi‑step tasks; reduces hallucinations by forcing evidence.

Implementation tips

  • Use typed tool schemas (reject malformed URLs/IDs); allow‑list tools and params.
  • Log every think/act/observe step; keep tool I/O in your trace for audit/RCA.
  • ReAct remains a strong baseline for knowledge tasks and simple web flows. arXiv

2) Reflexion - Self‑critique, then retry

Cheap quality lift when you can validate an answer (tests, heuristics, rules).

Implementation tips

  • Cap to 1-2 retries; require a pass/fail predicate (unit test, regex, score).
  • Store reflections as episodic notes (short‑term memory), distinct from long‑term.
  • Expose reflections in traces so humans learn too. arXivGitHub

3) Plan‑then‑Act (ReWOO) - Reason first, fetch later

Cuts token cost/latency by separating “thinking” from “gathering.”

Implementation tips

  • Treat evidence as a typed artifact (JSON/Markdown) with size caps; cache by (tool, query, params).
  • Fail fast if a planned tool isn’t reachable; re‑plan with cheaper backoff.
  • Reported 5× token efficiency on HotpotQA in the paper’s evaluations. arXivOpenReview

4) Tree‑of‑Thoughts / LATS - Explore multiple paths

For thorny planning/coding/web tasks: generate candidate “thoughts,” prune, expand; pick the best path.

Implementation tips

  • Bound breadth/depth; use a value function (LLM self‑rating or heuristics).
  • Persist each branch’s artifacts to tracing for RCA.
  • LATS unifies reasoning + acting + planning via MCTS; effective for programming/web. arXiv+1

5) Multi‑agent “teams” - Router → specialists → critic/approver

Compose skills, isolate risk, keep approvals where it matters.

Implementation tips

  • Keep each specialist’s tool surface area tiny; typed contracts only.
  • Use handoffs for clean delegation; approvals appear in the trace by default.
  • Track per‑role SLAs (latency, quality, intervention rate). OpenAI PlatformOpenAI GitHub

Toolbox (single glance)

  • PydanticAI - “FastAPI‑for‑agents” DX: typed tools & strict I/O validation, generic type‑safe agents, and pydantic‑graph under the hood. Great for contracts and guardrails.
  • OpenAI Responses API + Agents SDK - stateful runs, first‑party tools (web/file search, computer use), handoffs, guardrails, and built‑in tracing.
  • Anthropic Computer Use - desktop control via screenshot/mouse/keyboard for legacy UI flows (beta). Anthropic

Where it shines (public wins to learn from)

  • Consumer support at scale - Klarna: assistant handled ~two‑thirds of chats in month one (≈2.3M conversations), with major time/cost gains. Narrow scope + approvals = speed + trust. PR Newswire
  • Enterprise coding - GitHub Copilot Coding Agent: assign issues, open PRs, and tag you for review; available across Business/Enterprise plans; runs in a secure, cloud‑hosted dev environment. GitHub DocsThe GitHub Bl
  • E‑commerce copilot - Shopify Sidekick: merchant assistant launched via early access, now sharing production lessons on evals and GRPO training. The VergeShopify
  • Cloud IDE autonomy - Amazon Q Developer: agentic coding inside VS Code (actions on your behalf), with iterative updates across code review, tests, and docs. Amazon Web Services, Inc.+1

Pattern fit: ReAct/ReWOO for support & analytics; multi‑agent + approvals for coding & enterprise workflows; computer‑use when you must cross legacy UI.

Reality checks (a/k/a the limits)

Benchmarks that simulate “real work” remain sobering:

  • WebArena & VisualWebArena: realistic websites + visually grounded tasks expose brittleness in multi‑step UI flows and visual grounding. arXiv+1
  • AgentBench: across 8 environments, failures cluster in long‑term reasoning/decision‑making - agents need tighter scaffolding and evaluation. arXiv
  • SWE‑bench: real GitHub issues remain hard; naive autonomy disappoints without tests, sandboxes, and review gates. arXiv

Design consequences

  • Stay narrow: crisp goals, bounded steps, explicit success checks.
  • Prefer short horizons: many simple plans >> one heroic chain.
  • Trace everything: if you can’t inspect plan/tool calls, you can’t improve them.

Memory that lasts (without monstrous prompts)

Long‑lived assistants need OS‑style virtual memory: short‑term working context ↔ recall store ↔ long‑term archive. MemGPT formalizes this “paging” model so agents remember across sessions—without stuffing the prompt. arXiv

Security & governance (practical, not preachy)

  • Least‑privilege tools, time‑boxed creds; allow‑list domains/APIs for browsing.
  • Typed tool I/O and schema validation on every hop (your easiest win).
  • Approval gates for state‑changing actions; immutable traces for audit/RCA.

So… what does “limits of unlimited” really mean?

The promise is huge - but unbounded autonomy is a myth (for now). Agents don’t win because they do everything. They win because they do some things extremely well:

  • Tight scopes with observable steps.
  • Patterns that balance planning with evidence.
  • Typed contracts that make tool‑use safe and predictable.
  • Human approvals where stakes are high.

Where we go next (optimistic, and specific)

  • UI‑native agents become normal: “computer‑use” grows up, pairing with design systems so agents reliably click/type/verify across apps. Anthropic
  • Memory becomes skill: OS‑like paging (MemGPT) merges with reusable “procedures” learned from past runs-less prompting, more doing. arXiv
  • Evaluation goes CI: web/visual/coding benchmarks turn into continuous gates; we ship behind tests, not vibes. arXiv+2arXiv+2
  • Governance productizes: tracing dashboards + NIST profiles + OWASP checks feel like linting for autonomy - limits as a feature, not a bug. OpenAI GitHub

We can still be dreamers. The trick is dreaming with guardrails - so the horizon keeps getting closer, one reliable handoff at a time.