How does an AI agent choose what to do under token, latency, and tool-call budget constraints?

In this tutorial, we build a cost-aware planning agent that deliberately balances output quality against real-world constraints such as token usage, latency, and tool-call budget. We design the agent to generate multiple candidate tasks, estimate their expected costs and benefits, and then select an execution plan that maximizes value while staying within a strict budget. Herein, we demonstrate how agent systems can move beyond “always use LLM” behavior and instead reason explicitly about the trade-offs, efficiency, and resource awareness that are critical to reliably deploying agents in constrained environments. check it out full code here.

copy the codeCopiedUse a different browser

import os, time, math, json, random
from dataclasses import dataclass, field
from typing import List, Dict, Optional, Tuple, Any
from getpass import getpass


USE_OPENAI = True


if USE_OPENAI:
   if not os.getenv("OPENAI_API_KEY"):
       os.environ("OPENAI_API_KEY") = getpass("Enter OPENAI_API_KEY (hidden): ").strip()
   try:
       from openai import OpenAI
       client = OpenAI()
   except Exception as e:
       print("OpenAI SDK import failed. Falling back to offline mode.nError:", e)
       USE_OPENAI = False

We set up the execution environment and securely load the OpenAI API key at runtime without hardcoding it. We also initialize the client so that the agent seamlessly returns to offline mode if the API is unavailable. check it out full code here.

copy the codeCopiedUse a different browser

def approx_tokens(text: str) -> int:
   return max(1, math.ceil(len(text) / 4))


@dataclass
class Budget:
   max_tokens: int
   max_latency_ms: int
   max_tool_calls: int


@dataclass
class Spend:
   tokens: int = 0
   latency_ms: int = 0
   tool_calls: int = 0


   def within(self, b: Budget) -> bool:
       return (self.tokens <= b.max_tokens and
               self.latency_ms <= b.max_latency_ms and
               self.tool_calls <= b.max_tool_calls)


   def add(self, other: "Spend") -> "Spend":
       return Spend(
           tokens=self.tokens + other.tokens,
           latency_ms=self.latency_ms + other.latency_ms,
           tool_calls=self.tool_calls + other.tool_calls
       )

We define key budgeting abstractions that enable the agent to reason explicitly about costs. We model token usage, latency, and tool calls as first-class quantities and provide utility methods to accumulate and validate spend. This provides us with a clean base to apply constraints during planning and implementation. check it out full code here.

copy the codeCopiedUse a different browser

@dataclass
class StepOption:
   name: str
   description: str
   est_spend: Spend
   est_value: float
   executor: str
   payload: Dict(str, Any) = field(default_factory=dict)


@dataclass
class PlanCandidate:
   steps: List(StepOption)
   spend: Spend
   value: float
   rationale: str = ""


def llm_text(prompt: str, *, model: str = "gpt-5", effort: str = "low") -> str:
   if not USE_OPENAI:
       return ""
   t0 = time.time()
   resp = client.responses.create(
       model=model,
       reasoning={"effort": effort},
       input=prompt,
   )
   _ = (time.time() - t0)
   return resp.output_text or ""

We present data structures that represent individual action options and complete plan candidates. We also define a lightweight LLM wrapper that standardizes how text is generated and measured. This separation allows the planner to reason abstractly about tasks without being tightly tied to implementation details. check it out full code here.

copy the codeCopiedUse a different browser

def generate_step_options(task: str) -> List(StepOption):
   base = (
       StepOption(
           name="Clarify deliverables (local)",
           description="Extract deliverable checklist + acceptance criteria from the task.",
           est_spend=Spend(tokens=60, latency_ms=20, tool_calls=0),
           est_value=6.0,
           executor="local",
       ),
       StepOption(
           name="Outline plan (LLM)",
           description="Create a structured outline with sections, constraints, and assumptions.",
           est_spend=Spend(tokens=600, latency_ms=1200, tool_calls=1),
           est_value=10.0,
           executor="llm",
           payload={"prompt_kind":"outline"}
       ),
       StepOption(
           name="Outline plan (local)",
           description="Create a rough outline using templates (no LLM).",
           est_spend=Spend(tokens=120, latency_ms=40, tool_calls=0),
           est_value=5.5,
           executor="local",
       ),
       StepOption(
           name="Risk register (LLM)",
           description="Generate risks, mitigations, owners, and severity.",
           est_spend=Spend(tokens=700, latency_ms=1400, tool_calls=1),
           est_value=9.0,
           executor="llm",
           payload={"prompt_kind":"risks"}
       ),
       StepOption(
           name="Risk register (local)",
           description="Generate a standard risk register from a reusable template.",
           est_spend=Spend(tokens=160, latency_ms=60, tool_calls=0),
           est_value=5.0,
           executor="local",
       ),
       StepOption(
           name="Timeline (LLM)",
           description="Draft a realistic milestone timeline with dependencies.",
           est_spend=Spend(tokens=650, latency_ms=1300, tool_calls=1),
           est_value=8.5,
           executor="llm",
           payload={"prompt_kind":"timeline"}
       ),
       StepOption(
           name="Timeline (local)",
           description="Draft a simple timeline from a generic milestone template.",
           est_spend=Spend(tokens=150, latency_ms=60, tool_calls=0),
           est_value=4.8,
           executor="local",
       ),
       StepOption(
           name="Quality pass (LLM)",
           description="Rewrite for clarity, consistency, and formatting.",
           est_spend=Spend(tokens=900, latency_ms=1600, tool_calls=1),
           est_value=8.0,
           executor="llm",
           payload={"prompt_kind":"polish"}
       ),
       StepOption(
           name="Quality pass (local)",
           description="Light formatting + consistency checks without LLM.",
           est_spend=Spend(tokens=120, latency_ms=50, tool_calls=0),
           est_value=3.5,
           executor="local",
       ),
   )


   if USE_OPENAI:
       meta_prompt = f"""
You are a planning assistant. For the task below, propose 3-5 OPTIONAL extra steps that improve quality,
like checks, validations, or stakeholder tailoring. Keep each step short.


TASK:
{task}


Return JSON list with fields: name, description, est_value(1-10).
"""
       txt = llm_text(meta_prompt, model="gpt-5", effort="low")
       try:
           items = json.loads(txt.strip())
           for it in items(:5):
               base.append(
                   StepOption(
                       name=str(it.get("name","Extra step (local)"))(:60),
                       description=str(it.get("description",""))(:200),
                       est_spend=Spend(tokens=120, latency_ms=60, tool_calls=0),
                       est_value=float(it.get("est_value", 5.0)),
                       executor="local",
                   )
               )
       except Exception:
           pass


   return base

We focus on generating a diverse set of candidate steps, including both LLM-based and local options with different cost-quality trade-offs. We alternatively use the model itself to suggest additional low-cost improvements while controlling for their impact on the budget. By doing this, we enrich the work area without losing efficiency. check it out full code here.

copy the codeCopiedUse a different browser

def plan_under_budget(
   options: List(StepOption),
   budget: Budget,
   *,
   max_steps: int = 6,
   beam_width: int = 12,
   diversity_penalty: float = 0.2
) -> PlanCandidate:
   def redundancy_cost(chosen: List(StepOption), new: StepOption) -> float:
       key_new = new.name.split("(")(0).strip().lower()
       overlap = 0
       for s in chosen:
           key_s = s.name.split("(")(0).strip().lower()
           if key_s == key_new:
               overlap += 1
       return overlap * diversity_penalty


   beams: List(PlanCandidate) = (PlanCandidate(steps=(), spend=Spend(), value=0.0, rationale=""))


   for _ in range(max_steps):
       expanded: List(PlanCandidate) = ()
       for cand in beams:
           for opt in options:
               if opt in cand.steps:
                   continue
               new_spend = cand.spend.add(opt.est_spend)
               if not new_spend.within(budget):
                   continue
               new_value = cand.value + opt.est_value - redundancy_cost(cand.steps, opt)
               expanded.append(
                   PlanCandidate(
                       steps=cand.steps + (opt),
                       spend=new_spend,
                       value=new_value,
                       rationale=cand.rationale
                   )
               )
       if not expanded:
           break
       expanded.sort(key=lambda c: c.value, reverse=True)
       beams = expanded(:beam_width)


   best = max(beams, key=lambda c: c.value)
   return best

We apply budget-constrained planning logic that searches for the highest-value combination of stages under strict constraints. We implement a beam-style search with a redundancy penalty to avoid meaningless action overlap. This is where the agent actually becomes cost-aware by optimizing the price subject to constraints. check it out full code here.

copy the codeCopiedUse a different browser

def run_local_step(task: str, step: StepOption, working: Dict(str, Any)) -> str:
   name = step.name.lower()
   if "clarify deliverables" in name:
       return (
           "Deliverables checklist:n"
           "- Executive summaryn- Scope & assumptionsn- Workplan + milestonesn"
           "- Risk register (risk, impact, likelihood, mitigation, owner)n"
           "- Next steps + data neededn"
       )
   if "outline plan" in name:
       return (
           "Outline:n1) Context & objectiven2) Scopen3) Approachn4) Timelinen5) Risksn6) Next stepsn"
       )
   if "risk register" in name:
       return (
           "Risk register (template):n"
           "1) Data access delays | High | Mitigation: agree data list + ownersn"
           "2) Stakeholder alignment | Med | Mitigation: weekly reviewn"
           "3) Tooling constraints | Med | Mitigation: phased rolloutn"
       )
   if "timeline" in name:
       return (
           "Timeline (template):n"
           "Week 1: discovery + requirementsnWeek 2: prototype + feedbackn"
           "Week 3: pilot + metricsnWeek 4: rollout + handovern"
       )
   if "quality pass" in name:
       draft = working.get("draft", "")
       return "Light quality pass done (headings normalized, bullets aligned).n" + draft
   return f"Completed: {step.name}n"


def run_llm_step(task: str, step: StepOption, working: Dict(str, Any)) -> str:
   kind = step.payload.get("prompt_kind", "generic")
   context = working.get("draft", "")
   prompts = {
       "outline": f"Create a crisp, structured outline for the task below.nTASK:n{task}nReturn a numbered outline.",
       "risks": f"Create a risk register for the task below. Include: Risk | Impact | Likelihood | Mitigation | Owner.nTASK:n{task}",
       "timeline": f"Create a realistic milestone timeline with dependencies for the task below.nTASK:n{task}",
       "polish": f"Rewrite and polish the following draft for clarity and consistency.nDRAFT:n{context}",
       "generic": f"Help with this step: {step.description}nTASK:n{task}nCURRENT:n{context}",
   }
   return llm_text(prompts.get(kind, prompts("generic")), model="gpt-5", effort="low")


def execute_plan(task: str, plan: PlanCandidate) -> Tuple(str, Spend):
   working = {"draft": ""}
   actual = Spend()


   for i, step in enumerate(plan.steps, 1):
       t0 = time.time()
       if step.executor == "llm" and USE_OPENAI:
           out = run_llm_step(task, step, working)
           tool_calls = 1
       else:
           out = run_local_step(task, step, working)
           tool_calls = 0


       dt_ms = int((time.time() - t0) * 1000)
       tok = approx_tokens(out)


       actual = actual.add(Spend(tokens=tok, latency_ms=dt_ms, tool_calls=tool_calls))
       working("draft") += f"nn### Step {i}: {step.name}n{out}n"


   return working("draft").strip(), actual


TASK = "Draft a 1-page project proposal for a logistics dashboard + fleet optimization pilot, including scope, timeline, and risks."
BUDGET = Budget(
   max_tokens=2200,
   max_latency_ms=3500,
   max_tool_calls=2
)


options = generate_step_options(TASK)
best_plan = plan_under_budget(options, BUDGET, max_steps=6, beam_width=14)


print("=== SELECTED PLAN (budget-aware) ===")
for s in best_plan.steps:
   print(f"- {s.name} | est_spend={s.est_spend} | est_value={s.est_value}")
print("nEstimated spend:", best_plan.spend)
print("Budget:", BUDGET)


print("n=== EXECUTING PLAN ===")
draft, actual = execute_plan(TASK, best_plan)


print("n=== OUTPUT DRAFT ===n")
print(draft(:6000))


print("n=== ACTUAL SPEND (approx) ===")
print(actual)
print("nWithin budget?", actual.within(BUDGET))

We execute the selected plan and track actual resource usage step by step. We dynamically choose between local and LLM execution paths and aggregate the final output into a coherent draft. By comparing estimated and actual spending, we demonstrate how planning assumptions can be validated and refined in practice.

Finally, we demonstrated how a cost-aware planning agent can reason about its resource consumption and adapt its behavior in real time. We only executed steps that fit within the predefined budget and tracked actual spending to validate planning assumptions, closing the loop between estimating and execution. Furthermore, we also highlighted how agentic AI systems can become more practical, controllable, and scalable by treating cost, latency, and device usage as first-order decision variables rather than afterthoughts.

check it out full code here. Also, feel free to follow us Twitter And don’t forget to join us 100k+ ml subreddit and subscribe our newsletter. wait! Are you on Telegram? Now you can also connect with us on Telegram.

The post How does an AI agent choose what to do under token, latency, and tool-call budget constraints? Appeared first on MarketTechPost.

How does an AI agent choose what to do under token, latency, and tool-call budget constraints?

Why I recommend these 5 Linux file managers in place of GUIs – and they’re all free

It’s the end of an era for Sony TV

Related Articles

Leave a Comment Cancel Reply