Fitra's Drop

Agentic Engineering: The New Era of Engineers Building Software Factories

fitrakacamarga — Fri, 29 May 2026 12:32:42 GMT

One engineer spends $2,000/month on AI tokens. The result is useful, but inconsistent. Sometimes the code works. Sometimes it does not. Sometimes bugs are caught. Sometimes new ones appear. The engineer feels productive, but cannot prove it.

Another engineer spends the same amount. The result is different: pipelines that run by themselves, code that is automatically tested, and more time to think about the next architecture, the next product, or the next problem that has not been touched yet.

What separates them? Not the model — both use Claude. Not the token budget — both spend the same. Not raw talent — both are experienced engineers.

The difference is whether they are building a system around AI, or merely chatting with it.

The Problem: Everyone Has AI, But Not Everyone Has Leverage

In 2026, every engineer can access powerful models. Claude, GPT, Gemini, DeepSeek — the frontier keeps moving, context windows keep growing, tool integrations keep improving, and coding agents keep getting better.

But something interesting happens when capability becomes widely available: the gap between engineers who “use AI” and engineers who truly leverage AI gets wider, not smaller.

Access to the same model does not create the same outcome. Two engineers can sit in front of the same coding agent with the same token budget and walk away with completely different results.

It is not about who prompts more. More prompting without a system creates more noise.

It is not about who spends more. Tokens burned without a harness just become a larger API bill.

The real difference is this:

Who is building the system around AI, and who is treating AI as a smarter chatbot?

IndyDevDan calls this agentic engineering. I think the term is useful because it names a real shift: engineering is moving from “use an AI tool” to “design the environment where AI agents can produce reliable work.”

Agentic engineering is building the system that builds — safely, repeatedly, and cache-stably.

It is not about using AI more. It is about building the infrastructure that lets AI work consistently in production.

The Five Pillars of Agentic Engineering

1. Own the Harness — Whoever Controls the System Controls the Result

Claude Code, Codex, Cursor, OpenCode, and similar tools are powerful. But they are the floor, not the ceiling. They provide a strong baseline, but the real advantage appears when engineers begin to own the harness.

The model is the engine. The harness is the production system.

A harness is the infrastructure around the model that defines how the model works, when it may act, what must be checked before it acts, and how its results are audited and verified.

Concretely, a harness includes:

Tool access and permission boundaries — which tools are available, and what are their limits?
Verification gates before high-impact actions — before an agent performs something difficult to undo, what checker validates that the action is safe and aligned with the spec?
Loop detection — if an agent repeats the same action without progress, what stops it?
Event traces for audit and debugging — every action, decision, and output should be reviewable.
Reasoning budgets — how much thinking, cost, and latency should a task be allowed to consume?
Compaction strategy — when context gets long, how do we summarize without losing tool history, errors, and safe next actions?
Subagent handoff — if one task needs a large model for planning and a smaller model for execution, how does the handoff happen without breaking context or cache?

Without a harness, AI is a chatbot with coding ability. Helpful, but not production-trustworthy. There is no reliable guarantee that the output is consistent, safe, or correct.

With a harness, AI becomes part of a production system. Its outputs are verified. Its actions are controlled. Its cost is measurable. And when something goes wrong, we can trace it, debug it, and improve the system.

A good harness does not make agents weaker. It gives agents a safer, clearer, and more powerful workspace. Agents move faster when boundaries are explicit.

This is why “plan mode” in coding agents is such an important design pattern. Instead of changing the tool set when an agent enters planning mode — which can break prompt caching — the mode can be represented as a tool or event, such as EnterPlanMode and ExitPlanMode. The available tool schema stays stable, while the policy changes through messages or events.

This leads to a key design rule:

Policy changes belong in messages and events, not in tool schema changes.

2. Build Software Factories, Not One-Off Features

The old workflow is simple: ask AI to build a feature, review the result, deploy it, then repeat. Every feature starts from scratch. Every time, the engineer explains the context, gives the requirements, and checks whether the output is acceptable.

That helps, but it does not scale.

The agentic workflow is different: build a factory that produces features consistently.

A software factory is a repeatable production system made of agents, prompts, workflows, scripts, tests, reviewers, and automation. A feature no longer starts from zero. It moves through a known pipeline with known quality gates.

The core of a software factory is reproducibility.

If a workflow produces a good feature once, the workflow should be repeatable, auditable, improvable, and reusable. Prompts stop being disposable instructions. They become part of the production system.

A useful factory can:

read the requirement,
extract business context and technical constraints,
produce a technical plan,
critique the plan,
split the work into tasks,
implement changes,
run tests,
perform self-review,
generate a changelog,
prepare a pull request with evidence.

With a factory like this, the engineer is no longer the bottleneck for every small step. The engineer’s role moves up one level: design the process, set quality standards, and make sure every output passes the right validation.

This is also where reusable assets matter. When a workflow repeatedly works, it should become a skill, a playbook, a subagent, or an automation. The loop is: find repeated work, evaluate whether it deserves to become an asset, then package it in the smallest useful form.

If every feature still starts from zero, we do not have a software factory. We only have an agent helping manual work. Manual work with AI assistance is still manual work.

3. Build Extensible Software — Open for Extension, Closed for Modification

Agentic engineering lives in a world that changes fast. Models change every month, sometimes every week. Tool ecosystems shift. Today’s best prompt may not be tomorrow’s best prompt. APIs move. Runtime patterns evolve.

In this environment, software that cannot be extended falls behind.

A codebase full of cascading conditionals, tight coupling, implicit behavior, and poor documentation makes agents slow and error-prone. Every change becomes risky because the agent cannot clearly see the boundaries. Every new feature becomes expensive because the system must be modified instead of extended.

The old principle “open for extension, closed for modification” becomes even more important in the age of agents. It is not only an architecture best practice. It is a survival strategy.

Agent-friendly software should be:

Composable — features can be assembled from existing components.
Pluggable — models, tools, and skills can be swapped without changing the core.
Observable — actions and decisions can be traced.
Testable — components can be validated independently.
Explicit about contracts — interfaces are documented and stable.
Clear about boundaries — each module has a responsibility.
Stable at the interface level — existing APIs do not change without a migration path.
Easy to roll back — failures do not become irreversible drama.

The clearer the boundaries, contracts, and tests, the less likely an agent is to make wild changes. Agents work best when the codebase gives them rails.

Extensible software is easier for humans to maintain. It is also easier for agents to understand, modify, and verify.

In the age of agents, extensibility is not just an architecture principle. Extensibility is a safety feature.

4. Always-On Agents — But With Token Economics

It is easy to imagine the future as always-on agents: agents coding while we sleep, monitoring production issues, reading logs, writing reports, fixing bugs, or running research workflows continuously.

That vision is real. But it has a trap.

An always-on agent without clear economics is just a money-burning machine.

Healthy token economics has three levels:

Level 1: Spend tokens. Start using agents more. Do not be afraid to spend — but only if the next two levels are also true.

Level 2: Make tokens useful. Tokens must produce real work: bugs fixed, PRs merged, reports used, insights discovered, tests passed.

Level 3: Capture value. Agent output must produce measurable value: time saved, revenue gained, risk reduced, quality improved, onboarding accelerated, or research progress made.

Tokens must become useful before they are allowed to scale.

A rising API bill is a productivity KPI only when outcomes rise with it. If token spend grows 3x but outcomes grow 1.5x, something is wrong. Usually the agent does not have a clear task, strong evaluator, adequate access, or proper stop condition.

Always-on agents need SLOs like production systems:

completed outcomes per day or week,
cost per accepted change,
cache hit rate,
high-risk actions blocked,
human time saved,
false positive rate from safety gates,
time-to-safe-completion.

Do not scale agents first. Prove the useful-token loop first. Before that, optimization means clarifying the task, improving access, strengthening evaluators, and defining stop conditions.

5. Give Agents Access — With Deterministic Safety Gates

An agent that cannot reach APIs, CLIs, dashboards, or databases will keep asking humans to do things that should be automatic. That is a token tax: we pay tokens for waiting instead of completion.

If an agent needs database status but has no database access, it asks a human. The human checks, replies, and the agent continues. One loop can take minutes. With safe direct access, it could take seconds.

But the answer is not “give agents everything.” Unlimited access to production databases, deployment pipelines, or customer data is dangerous.

The right principle is:

Least privilege for maximum useful autonomy.

Or even more practically:

Give agents reachable tools with deterministic safety gates.

Not everything. Not nothing.

Access should be routed by risk:

Read-only → allow — read files, query with SELECT, inspect logs, read documentation.
Write/reversible → verify — edit code, open PRs, update config, create drafts.
High-impact/ambiguous → escalate — merge to main, modify production data, change pricing, send external communication.
Destructive/irreversible → block — drop databases, delete production resources, revoke credentials, force-push to main.

With this routing, agents can move quickly on safe actions while risky actions require evidence, verification, or human approval. Audit trails record what happened. Verifier routes ensure mutations pass through the correct gate.

Agentic speed comes from access. Production trust comes from verification. We need both.

For systems like ClaimMind, this is critical. An LLM may propose a diagnosis, identify a likely coding error, or recommend a correction. But the final decision must come from deterministic verification.

Narrative can explain. Evidence must authorize.

Two Invariants for Production-Grade Agents

Across the work on harness design, prompt caching, benchmark results, and verifier-gated execution, two invariants keep showing up.

Invariant 1: Verification-Aware Control Flow

High-impact actions must be routed by evidence, not by the agent’s intent text.

This is not about saying AI cannot be trusted. It is about engineering discipline. An agent may propose anything. A proposal becomes an action only after the system verifies that:

the context is sufficient and current,
the action matches policy,
supporting evidence exists,
the risk level is classified correctly,
the route is appropriate: allow, verify, escalate, or block.

The LLM proposes. Deterministic verifiers dispose.

In an agent runtime, completion text should not be enough. An agent saying “I’m done” is only a signal. Completion should be accepted only after tests, evidence, validators, and typed decisions say it is safe.

Ralph reads signals. The verifier authorizes exit.

Invariant 2: Cache-Aware Execution Architecture

Runtime adaptation must preserve stable prefixes, stable tool contracts, and cache-safe forks.

The biggest lesson from Claude Code’s prompt caching architecture is this: prompt caching is not a minor optimization. It is architecture.

Long-running agents become economically feasible because prompt caching reduces latency and cost. But prompt caching is fragile. It works through prefix matching. Changes near the beginning of the request — system prompt, tool schema, tool order, project context — can invalidate the cache.

That means:

do not add or remove tools mid-session,
do not update the system prompt for small dynamic changes,
inject changing context through messages or reminders,
avoid switching models mid-session unless using subagent handoff,
make compaction cache-safe,
use stubs and deferred loading for tool search.

The design rule is simple:

Tool policy may change at runtime; the tool prefix must not.

This changes how we design safety gates. A risk gate should not enforce policy by changing the tool set. It should keep the tool contract stable and route runtime decisions through messages, events, wrappers, or typed decisions.

A risk gate must work. It must also avoid breaking the prefix.

Safe and stable. Not one or the other.

From Vibe Coding to Harness Engineering

Vibe coding captured the first phase of AI-assisted development: fast, exploratory, chaotic, and often useful. It is great for prototypes and idea exploration.

But production software cannot stop at vibes.

Once agent output touches serious codebases, customer workflows, regulated data, financial systems, or production infrastructure, we need stronger discipline:

clear requirements,
stable context,
tool boundaries,
verification gates,
test harnesses,
audit trails,
rollback paths,
human approval for risky decisions.

That is the shift from vibe coding to harness engineering.

Harness engineering does not weaken agents. It gives them a safer, clearer, and more powerful workspace.

Vibe coding increases exploration speed. Harness engineering increases production speed.

Agentic Engineering as a New Discipline

When we combine the five pillars and two invariants, agentic engineering becomes much more than prompting.

It includes:

harness design,
workflow orchestration,
software factory design,
prompt and context architecture,
tool contract design,
verification and approval gates,
cache-aware execution architecture,
evaluation and testing strategy,
observability,
memory and compaction design,
subagent orchestration,
token-to-value economics,
security and access control.

Agentic engineering sits at the intersection of software architecture, platform engineering, security, DevOps, and AI systems.

The engineers who win here are not necessarily the ones with the fanciest prompts. They are the ones who understand systems, tradeoffs, reliability, testing, architecture, and production risk.

Anyone can move faster with AI coding tools. But engineers who understand agentic engineering can build systems that make many agents move fast, safely, and consistently.

That is the real leverage.

Conclusion: Engineers as Designers of Agent Systems

Agentic engineering will become a core engineering skill.

Not because every engineer must become a prompt engineer, but because software engineering is shifting from writing every line of code manually to designing systems where agents can produce code, tests, reviews, documentation, and operations safely.

The engineers who stand out will become designers of agent systems. They will design harnesses, define contracts, grant the right access, build factories, install verifiers, measure token-to-value, and keep the system reliable in production.

Models will change. Tools will change. The ability to build systems where agents work safely, repeatedly, and measurably will remain a moat.

The future of software engineering is not humans being replaced by agents. The future is engineers building work systems where many agents can operate like a small coordinated organization.

Agent is leverage.

Harness is control.

Factory is scale.

Extensibility is durability.

Access is speed.

Verification is trust.

Cache stability is economics.

Token-to-value is proof.

That is agentic engineering.

Inspired by IndyDevDan’s “Top #1 Opportunity for Senior Engineers: Agentic Engineering,” Thariq’s “Lessons from Building Claude Code: Prompt Caching Is Everything,” and Vaibhav Srivastav’s Codex skillify prompt.

When Data Is Cheap, Insight Is Everything

fitrakacamarga — Fri, 29 May 2026 12:28:31 GMT

In 1865, an English economist named William Stanley Jevons noticed something strange. James Watt’s improved steam engine was designed to use less coal. It did — per engine. But total British coal consumption had risen tenfold. Efficiency hadn’t reduced demand. It had created it.

Cheaper steam power made railroads viable. Iron smelting became affordable enough to industrialize. Ocean shipping reorganized around coal-fired engines. The savings per engine were real, and they were completely swamped by the proliferation of engines.

Jevons had stumbled on one of the most durable findings in economics: when the cost of an input collapses, demand for that input explodes. Not shrinks. Explodes. The pattern repeats everywhere. Cheaper lighting lengthened the working day. Cheaper computation built an information economy that now consumes more electricity than most countries.

We are about to learn this lesson again. The input this time is data access. And the place we’re about to learn it is every company that runs on dashboards.

The Age of Cheap Queries

For most of the history of business analytics, access to data was rationed by labor. You needed someone who could write SQL. Someone who knew which table was the right table. Someone who could clean the data, join it, validate it, and render it into a chart a human could actually read. The cost of asking a question was high, so questions were rationed.

AI agents are collapsing that cost. Natural language to SQL. Automated cleaning. Instant summaries. Charts generated on demand. A product manager can now ask, “What’s our churn rate by cohort for the last six months?” and get an answer in seconds — not days.

A reasonable observer might predict that this makes analysts obsolete. If anyone can query the data, who needs the person who used to write the queries?

This is the wrong prediction. And Jevons explains why.

What Jevons Did to Analytics

When the cost of querying data falls, two things happen at once. More people ask more questions, more often, about more things. That part is obvious. The less obvious part is that the bottleneck migrates.

When data was expensive to access, the binding constraint was access. You needed SQL skill, dashboard permissions, ETL pipelines, and time. When AI removes that friction, the constraint doesn’t disappear — it moves to the next layer.

It moves to: which insights are real, which are artifacts of bad data, which are actionable, and which are worth acting on.

In other words, the bottleneck moves from access to judgment.

This is the Jevons paradox applied to analytics. Cheaper queries don’t reduce the need for analytical thinking. They increase it — because now the organization is flooded with answers, and the scarce resource is knowing which answers matter.

Wheat and Bread

There’s a concept from the rabbinic tradition that captures this perfectly. The Talmud distinguishes between two kinds of expertise: the master of wheat — someone who has read everything, who knows every source — and the baker — someone who takes raw material and transforms it into something that sustains people.

The tradition’s verdict is unambiguous: wheat alone is not bread. Information was never the goal. The tradition lives in what gets made from it.

AI is the new master of wheat. It can retrieve, summarize, translate, and connect information at a speed no human can match. But retrieval is not insight. Summary is not judgment. A chart is not a decision.

The products that win — in analytics, in operations, in any domain where data meets action — are not the ones that give you more wheat. They are the ones that have a baking layer: validation, judgment, workflow, audit trails, human approval gates, and learning loops that get smarter over time.

What “Baking” Looks Like in Practice

Consider two contexts: one vertical and highly regulated, one horizontal and cross-functional.

ClaimMind: A Vertical System of Intelligence (Healthcare RCM)

When a hospital in Indonesia processes a BPJS insurance claim, the raw material is already there: diagnosis codes, procedure codes, patient history, coverage rules, and reimbursement policies.

An AI can retrieve all of this in milliseconds. It can cross-reference thousands of regulations. It can even generate a suggested claim submission.

But what makes the claim trusted — what makes the hospital confident it will be approved and paid — is not retrieval. It is the layer on top:

Validation: Do diagnosis and procedure codes actually match?
Cross-reference: Are there contradictions or duplicates?
Judgment: Is this claim borderline and in need of human review?
Audit trail: Can we prove every decision path to auditors?
Learning loop: When a claim is rejected, does the system learn why?

The wheat is data. The bread is a trusted reimbursement decision.

That is what ClaimMind is built for: not faster querying, but decision-grade claim intelligence.

Seeknal: A Horizontal System of Intelligence (Data/ML/Analytics Ops)

Now zoom out to organizational operations.

Most companies run on fragmented systems: CRM, ERP, finance, inventory, support, internal tools. AI can query each one quickly. But speed across fragmented systems often scales inconsistency, not clarity.

Seeknal exists to be the baking layer across that environment:

Pipeline validation: Is data fresh, complete, and structurally sound?
Cross-system reconciliation: Do operational metrics reconcile with financial reality?
Policy-aware workflows: What can be auto-executed, and what must be escalated?
Decision traceability: Can every recommendation be explained and audited?
Outcome learning: Do failed actions improve future decisions?

Again, the point is not access. The point is trustworthy action.

Seeknal and ClaimMind are different products in different domains, but they share the same architecture: a harness that turns cheap data access into reliable decisions.

The Shift: System of Record to System of Intelligence

This is where the “system of intelligence” perspective becomes critical.

A system of record stores facts. A system of intelligence decides what to do with those facts.

This distinction is now strategic.

As Steph Zhang has argued, value is moving from storage layers to reasoning-and-action layers. The moat is no longer who stores the most information. The moat is who can repeatedly convert information into high-quality decisions under real constraints.

Ben Lang’s lens complements this: the winners won’t be thin AI wrappers that generate output. They will be products that own workflow, context, and execution from end to end.

Cheap retrieval gives everyone wheat. Systems of intelligence are what bake bread.

Why This Matters Now

We are in the middle of a great flattening. AI is making data access, query generation, report creation, and summarization incredibly cheap. This is real progress. Millions of people who were previously excluded from data-driven decision-making are now included.

But Jevons tells us what happens next. Cheap access creates appetite for more. More queries, more reports, more dashboards, more “insights.” Organizations don’t become smarter automatically. They become noisier. The signal-to-noise ratio drops.

And the thing everyone actually wants — a clear, trustworthy answer they can act on — becomes harder to find, not easier.

That is why the next generation of data products won’t be better dashboards. They won’t be faster chatbots. They will be agentic workflows that don’t just fetch data, but bake it: validate it, contextualize it, flag what needs human judgment, execute what doesn’t, and leave an audit trail behind.

The workflow looks like this:

ask → inspect data → generate query → validate result → explain meaning → approve/escalate → act → learn

Everything after “generate query” is the real product. Everything after “generate query” is where value compounds.

The Bottleneck Has Moved

Every era of technology has a bottleneck. For a long time, analytics bottlenecks were access: SQL expertise, ETL pipelines, dashboard licenses, data engineering headcount.

AI is removing that bottleneck. And in doing so, it is revealing the next one:

The capacity to turn cheap data into trusted decisions.

This is not solved by “more model.” It is solved by systems:

systems with validation layers
systems with approval gates
systems with audit trails
systems that learn from outcomes

In that sense, Seeknal and ClaimMind are not two unrelated products. They are two expressions of the same thesis:

Seeknal is a system of intelligence for organizational data operations.
ClaimMind is a system of intelligence for healthcare reimbursement operations.

Different domains. Same architecture of trust.

When data is cheap, insight is everything. And insight doesn’t come from the model alone. It comes from the harness around it.

References

Steph Zhang:

Ben Lang:

Zohar Atkins (inspiration):

Token-Maxing Is Not Waste. It Is How Builders Buy the Future.

fitrakacamarga — Wed, 06 May 2026 13:48:12 GMT

There is a phase in using AI coding agents where it looks like you are burning money.

The context window is full. Claude Code runs multiple times. Cursor regenerates again. An agent reads hundreds of files, writes a patch, fails, retries, refactors, and eventually produces something you still need to review.

From the outside, this looks wasteful.

I am increasingly convinced this is the wrong way to see it.

For founders, engineers, and builders working on things whose final shape is not yet obvious, token-maxing is not just an operational cost. Token-maxing is R&D investment.

You are not only buying output.

You are buying a map.

Token-maxing is discovery budget

The biggest mistake in calculating AI cost is treating tokens only as units of work.

If 10 million tokens produce one exploration sprint, people ask:

“Was that sprint worth it?”

The better question is:

“After those 10 million tokens, do we now understand a new way of working that was invisible before?”

Because in practice, token-maxing often produces things that do not immediately show up in the diff:

we discover architectures we were previously afraid to try
we test three to five approaches in one afternoon
we see failure modes earlier
we learn prompt, review, evaluation, and workflow patterns that can be reused
we build confidence to take on more ambitious scope
we absorb work that normally requires several people: engineer, analyst, QA, tech writer, researcher

The tokens that were “burned” are often not waste.

They are exploration cost.

And in software, exploration that expands the ambition frontier is often far more valuable than small efficiency gains on old tasks.

The old way of calculating AI ROI is too narrow

Many people calculate AI like this:

“I paid this many dollars for tokens. How many engineering hours did I get back?”

That is useful, but too narrow.

The biggest ROI from AI agents is not just replacing engineering hours. The biggest ROI appears when a small team can build something that previously did not make sense to attempt.

Without AI, founders often think:

“This needs a backend engineer, frontend engineer, data engineer, QA, DevOps, and a domain expert. Later.”

With AI agents, the question changes:

“What is the first version I can prove this week?”

That change in question is valuable.

Once the ambition frontier moves upward, the roadmap changes with it. A product that used to feel like a six-month project can become a two-week prototype. An experiment that used to require a small team can be handled by a founder plus an agentic workflow. Documentation, benchmarks, tests, demos, and internal tools can run in parallel.

This is where token-maxing becomes a long-term investment.

The Uber story is the tension in one headline

A recent Uber example captures the exact tension.

In April 2026, India Today reported, citing The Information, that Uber CTO Praveen Neppalli Naga said the company had already exceeded its AI spending assumptions because adoption of AI coding tools such as Claude Code had grown faster than expected. His quote is the kind of line that makes token-maxing look irresponsible:

“I’m back to the drawing board, because the budget I thought I would need is blown away already.”

Other summaries of the same story framed it even more dramatically: Uber had burned through its full-year AI budget in roughly four months, driven by Claude Code and Cursor usage. SmarterX described the episode as part of a larger enterprise problem: AI agents are arriving faster than budgets, governance, permissions, and architecture can absorb.

At first glance, this sounds like waste.

But the more interesting read is not “Uber wasted money on AI coding.” The more interesting read is that Uber discovered a budgeting model mismatch.

AI coding agents do not behave like ordinary SaaS seats. A per-seat tool can be forecast from headcount. A token-consuming agent workflow compounds with usage intensity: longer contexts, parallel agents, repeated retries, test generation, code review, and background tasks. Once a large engineering organization actually finds the tools useful, spend can rise much faster than the original plan.

That does not automatically make the spend good. It also does not automatically make it bad.

It means the unit of planning changed.

The lesson from Uber is not “stop using AI coding tools.” It is: if token-maxing is becoming a serious workflow, finance, engineering, and product leadership need a new operating model. Track outcomes, not just tokens. Separate waste from exploration. Put observability around loops and context decay. Budget for learning, not only for seats.

This is exactly why I prefer to call it R&D budget rather than AI usage cost.

A simple model: 10 million tokens as learning sprint

Sometimes the right unit is not one prompt, one PR, or one session. It is a 10-million-token learning sprint - enough to load a full codebase context, run multiple agent passes, generate tests, write docs, retry failed approaches, and produce a reviewable artifact.

Let us use rough numbers.

Two to four such sprints per week means roughly:

80-160 million tokens per month, or about 5 million per day during intensive phases
around US$750-3,000 per month depending on model mix and subscription vs API
around Rp13-53 million per month

At 150 million tokens per month, token-maxing stops looking like casual AI usage and starts looking like an R&D line item.

In Indonesia, Rp13-53 million per month is comparable to 0.5-2 fully-loaded engineers, depending on seniority and pricing model.

So the bar cannot be:

“Did this token spend generate more code?”

That is too narrow. The better bar is:

“Did this spending reduce our uncertainty faster than normal working methods?”

Because proper token-maxing does not buy lines of code. It buys learning velocity.

With 150 million tokens per month, the outcomes worth seeking are not just patches, but:

discovery cycles that normally take 4 weeks compressed to 1-2 weeks
1-2 wrong implementation paths eliminated before they become expensive
prototypes real enough to test with users or prospective customers
test suites, evals, benchmarks, and harness rules that can be reused
architecture notes and decision memos that clarify tradeoffs
founders knowing faster whether a product direction is worth pursuing
hiring delayed until the problem shape is clearer, not because AI replaces humans permanently

This is the critical difference.

If tokens only produce untested code, unused documents, or agent loops that never changed a decision, that is waste.

But if tokens produce clarity, artifacts, and sharper decisions, the spending starts to look like R&D.

Three ROI scenarios in Indonesia

This is not final accounting. It is a mental model for seeing the scale of impact.

Conservative case

Say token-maxing costs around Rp13-18 million per month.

The realistic outcome is not “replacing two engineers.” The more defensible outcome is:

avoiding one wrong build path
accelerating one small prototype
producing initial tests and evals
helping a founder decide whether a feature is worth pursuing

If one wrong decision typically costs 2 weeks of engineering time, then avoiding a single false start can already justify the token cost.

This is not “millions of dollars” yet. But it is already huge for an early-stage Indonesian startup.

Base case

Say the cost is around Rp26-36 million per month.

At this level, token-maxing should produce more than ad-hoc output. It should produce a working system:

reusable prompts
regression tests
benchmark scenarios
architecture notes
internal tools
review checklists
deployment or validation workflows

The defensible outcome: one discovery cycle that normally takes a month compressed to 1-2 weeks, and one specialist hire delayed until scope is genuinely clear.

Not because AI replaces that hire forever, but because we are no longer hiring in the fog.

Here, token-maxing starts to look like serious leverage. Not because tokens are cheap, but because uncertainty is expensive, and tokens reduce it.

Ambitious case

Say the cost approaches or exceeds Rp53 million per month.

At this point, token-maxing should be treated like a real R&D budget. There should be clear exploration targets:

a new product wedge
a faster pilot
a claim review workflow that can be tested
an eval harness that makes quality measurable
a revenue hypothesis that can be validated earlier

If spending this much only produces more raw output, that is waste.

But if it accelerates revenue validation, opens a new product direction, or prevents the team from building the wrong system for 2-3 months, the return can far exceed the token cost.

This is where token-maxing can enter million-dollar territory.

Bottom line

Token-maxing only pays off when the output is not just code.

It has to reduce uncertainty.

It has to produce reusable artifacts.

It has to make the next decision sharper.

At 150 million tokens per month, we are no longer talking about casual AI usage. We are talking about a deliberate R&D budget.

And like any R&D budget, the question is not whether every experiment succeeds.

The question is whether the portfolio of experiments moves the company toward a future it could not reach otherwise.

For individuals: US$200/month is not an AI splurge

This argument is not only for companies.

For individuals - engineers, founders, operators, analysts, researchers, or anyone trying to level up in the AI era - a maximum-tier AI subscription, for example US$200 per month, should also be seen as a long-term tactical tool.

In Indonesia, US$200 per month is roughly Rp3.5 million per month. In one year, around Rp42 million.

That sounds expensive compared with normal consumer subscriptions. But compared with career investment, the number starts to make sense.

An engineer earning Rp20-30 million per month makes Rp240-360 million per year. An AI subscription costing Rp42 million per year is about 12-18% of annual income to buy daily leverage.

The question is not:

“Is US$200 expensive?”

The question is:

“Can US$200/month raise my skill, output, and ambition level faster than the alternatives?”

If used properly, the answer is often yes.

Because a maximum-tier AI subscription is not just a tool for answering questions. It can become:

a coding partner for reading codebases and producing patches
a private tutor for learning new frameworks
a reviewer for writing, proposals, and technical decisions
a research assistant for papers, docs, and competitor landscapes
a product thinking partner for turning ideas into experiments
a QA assistant for test cases and edge-case checklists
a tactical career coach for portfolios, CVs, interview prep, and technical communication

For individuals, token-maxing is a way to buy compound learning.

If a US$200/month subscription helps someone increase their salary from Rp25 million to Rp35 million per month, the additional income is Rp10 million/month, or Rp120 million/year.

The AI cost is Rp42 million/year.

Rough net benefit: Rp78 million/year.

That does not include side projects, consulting, or the ability to build your own product. Even one small micro-SaaS that makes US$1,000 MRR becomes around US$12,000 ARR - roughly 5x the annual AI subscription cost of US$2,400.

But there is an important caveat: an expensive subscription does not automatically make someone more productive.

US$200/month only becomes an investment if it is actively used to build assets:

repositories you can show
technical writing that strengthens your personal brand
automation for daily work
reusable templates and workflows
deeper domain understanding
new skills that are actually practiced

If it is only used for random chatting, article summaries, or one-off generated answers, then yes, it can be waste.

But if it is used as a daily gym for thinking, building, writing, evaluating, and shipping, US$200/month is a tactical tool for a long-term strategy.

At that point, token-maxing is not a lifestyle.

It is a career strategy.

The compound effect is confidence

The biggest saving from AI does not always show up in this month’s P&L.

The biggest saving appears when the team becomes more courageous.

Before AI agents, many ideas die inside the founder’s head because they feel too expensive:

“We do not have a data engineer yet.”
“Backend bandwidth is not available.”
“Nobody is handling QA.”
“Maybe after we hire.”
“This is too complex for now.”

After a few months of token-maxing, the thinking changes:

“We can prototype it first.”
“We can ask the agent to read the codebase and propose a patch.”
“We can generate the test harness.”
“We can build an internal validation tool.”
“We can explore a new workflow without waiting for headcount.”

This confidence compounds.

One internal tool makes the next experiment faster. One benchmark makes the next refactor safer. One agent workflow makes the team brave enough to take on a more complex use case. One successful product wedge opens a new revenue stream.

Viewed per session, token-maxing looks messy.

Viewed over a year, token-maxing looks like an organization building a new muscle.

Case study: ClaimMind

ClaimMind is a useful example because the problem is not simply “use AI to answer claim questions”.

The real problem is deeper: how do you build workflow intelligence for claim review that is evidence-grounded, policy-bound, and auditable?

In the context of BPJS and hospital claim operations, AI cannot merely produce an answer that sounds correct. The system must be able to:

read claim documents and metadata
check evidence completeness
understand ICD, procedure, and tariff paths
detect conflicts with policy
choose the right branch: auto-pass, clarification, escalate, reject/block
preserve an audit trail
explain why a decision was made

This is not a chatbot.

This is a workflow system.

And to build a system like this, the token-maxing phase matters.

At the beginning, we do not know every branch. We do not know which edge cases appear most often. We do not know the right harness shape. We do not know which prompts are robust under user pressure. We do not know which parts should be verified by a rule engine, which parts can be assisted by an LLM, and which parts must remain human-in-the-loop.

Token-maxing lets ClaimMind explore that possibility space faster:

agents can help map BPJS exception workflow branches
agents can generate adversarial scenarios: override pressure, shortcut requests, misleading summaries, state confusion
agents can create a minimum audit schema
agents can propose scoring axes: branch accuracy, evidence sufficiency, policy compliance, escalation precision
agents can write test scenarios and regression suites
agents can help design a harness so the AI does not merely “answer”, but follows a policy-bound workflow

If all of this is done manually by a small team, you need several roles at once: domain analyst, product engineer, backend engineer, QA, technical writer, and applied AI engineer.

With token-maxing, those roles do not disappear. But the exploration phase compresses. The founder can see the system map faster, make sharper decisions, and hire after the shape of the problem becomes clearer.

That is the difference between burning tokens and buying clarity.

Observability matters, but do not become cheap with tokens

Interestingly, a recent 30-day scan showed that token-maxing is already real enough to create dedicated observability tooling.

Projects like tokscale, codeburn, claude-usage, and token-optimizer help teams see token usage, cost, session history, ghost tokens, and context decay.

This is an important signal.

Whenever a new behavior creates observability tooling, it usually means the behavior has become a serious workflow. People do not build dashboards for things that do not matter.

But observability is not an excuse to become cheap with tokens.

Observability should help us distinguish two things:

Waste - tokens consumed because of loops, poor prompts, broken context, or unclear task definition.
Investment - tokens used for exploration, building, testing, learning, and creating reusable workflows.

The first should be reduced.

The second should be managed like an R&D portfolio.

Good token-maxing has discipline

I am not saying every use of tokens is automatically good.

Token-maxing without discipline can absolutely become waste. If an agent loops without acceptance criteria, reads irrelevant files, or keeps generating without tests, that is not investment. That is noise.

Healthy token-maxing has a few rules:

There is a learning objective. We know what we are trying to discover: architecture, failure mode, benchmark, product wedge, or implementation path.
There is an artifact. A large session should produce something reusable: a document, patch, test, eval, prompt pattern, harness rule, or decision memo.
There is a verification gate. AI output must be tested, reviewed, or at least checked against a rubric. Without verification, tokens only generate false confidence.
There is compaction discipline. Large context must be managed. Once context decays, the agent can become more expensive and worse at the same time.
There is a lightweight postmortem. After an expensive session, ask: what did we learn, what is reusable, and what should we avoid tomorrow?

With this discipline, token-maxing is not “high AI usage”.

It becomes a structured learning system.

Do not optimize too early

In the early phase of AI-native building, I am more afraid of founders saving tokens too early than using too many tokens.

Premature optimization in the AI era can take a new form:

“Do not use too much context.” “Do not run the agent too often.” “Do not try that approach, it is expensive.” “Do not explore too far.”

But in the early phase, our job is precisely to discover what is possible.

Correct token-maxing does not mean spending without limits. Correct token-maxing means being willing to buy a faster learning loop, then converting that learning into software, workflow, revenue streams, and organizational leverage.

For a company like ClaimMind, this can be the difference between building a generic claim chatbot and building infrastructure for auditable claim intelligence.

One saves cost.

The other opens a new category.

And sometimes, 10 million tokens are not a cost.

They are a down payment on a more ambitious future.

References

India Today - “Uber CTO says AI spending plans fall short as tools like Claude Code drive costs up” (Apr 15, 2026). Reports Uber CTO Praveen Neppalli Naga’s quote about going “back to the drawing board” after AI spending assumptions were exceeded. https://www.indiatoday.in/technology/story/uber-cto-says-ai-spending-plans-fall-short-as-tools-like-claude-code-drive-costs-up-2896621-2026-04-15
The Information - “Uber CTO Shows How Claude Code Can Blow Up AI Budgets.” Original cited report on Uber’s AI coding budget pressure. https://www.theinformation.com/newsletters/applied-ai/uber-cto-shows-claude-code-can-blow-ai-budgets
SmarterX - “Why No One Has Enterprise AI Agents Figured Out Yet” (Apr 21, 2026). Connects the Uber budget story to broader enterprise agent budgeting, governance, and architecture issues. https://smarterx.ai/smarterxblog/ai-agents-enterprise-budgets
Wall Street Journal - “Why Some Companies Say AI ‘Tokenmaxxing’ Is Key to Survival.” Covers the broader tokenmaxxing debate and Meta’s token usage leaderboard. https://www.wsj.com/cio-journal/why-some-companies-say-ai-tokenmaxxing-is-key-to-survival-e699a128
OpenAI - “Codex for (almost) everything.” Describes expanding coding-agent workflows, including background computer use and parallel agents. https://openai.com/index/codex-for-almost-everything/
Business Insider - “Microsoft exec suggests AI agents will need to buy software licenses, just like employees.” Useful context for why agent workflows may break old per-seat SaaS planning assumptions. https://www.businessinsider.com/microsoft-executive-suggests-ai-agents-buy-software-licenses-seats-2026-4

The Last Mile Is the Job in the AI Era

fitrakacamarga — Tue, 28 Apr 2026 06:05:24 GMT

Most conversations about AI agents focus on how much work the agent can do. Can it write the code, generate the deck, analyze the dataset, triage the support inbox, or automate the outreach sequence?

That framing is directionally useful, but it misses the deeper shift. The problem is not that AI agents are failing. The bigger change is that in the AI era, the highest-value part of many jobs increasingly lives in the last mile: judgment, correction, integration, accountability, and the ambition to raise the bar once basic execution becomes cheaper.

The most important question is not how much of the task an agent can start. It is whether the system can reliably finish the job.

In practice, many agents can already do 80 to 90 percent of a workflow. That is real progress. But the remaining 10 percent often contains a disproportionate amount of the value and risk. This is the stretch where someone still has to review the output, catch subtle mistakes, decide whether the work is actually good enough, fit it into a larger system, and take accountability for shipping it.

That final stretch is the last mile. And I increasingly think it is not just a leftover problem. It is becoming the job.

The dangerous illusion of 90 percent automation

A lot of AI demos look incredible because they show the part of the workflow that is easiest to appreciate visually. The agent reasons through a problem, opens tools, generates an answer, and appears to complete the task end to end. In a demo, this feels close enough to autonomy.

But production is not a demo.

A coding agent can write a lot of code and still leave the hardest part undone: code review, security review, production readiness, observability, rollback planning, and ownership after deployment.

A research agent can summarize papers and draft a literature review, but someone still has to verify whether the citations are real, whether the novelty claim holds up, whether the framing is honest, and whether the argument survives external scrutiny.

A sales agent can automate outreach and qualification, but closing the loop still requires customer judgment, trust building, negotiation, and context that rarely fits neatly into a prompt.

This is why partial automation gets overvalued. Doing 90 percent of a task is irrelevant if the remaining 10 percent determines whether the result can actually be used.

Why the last mile is the hardest part

The last mile is not just the leftover work. It is usually the highest-consequence work.

This is where judgment shows up. This is where taste matters. This is where the system encounters the edge cases that were invisible in the happy path. This is where a low-quality output can quietly become an expensive problem.

In other words, the last mile is where teams have to answer the questions that models are still bad at answering with confidence:

Is this actually correct?
Is it safe enough to ship?
Does it fit the broader context?
Did we miss a failure mode?
Is this good, or does it just look finished?

The people who can answer those questions create the real value. Their job shifts from manual execution toward adjudication: reviewing, correcting, integrating, escalating, and deciding when something is done.

That is also why so many arguments about AI replacing jobs miss the point. The issue is not whether a model can do most of the visible work. The issue is whether anyone can trust the system to complete the entire job at the level that reality demands.

Better AI does not eliminate the last mile. It creates a new one.

There is a common assumption hiding underneath a lot of automation discourse: once agents go from 80 percent to 95 percent to 99 percent task completion, the remaining human work will disappear.

I do not think that is how this plays out in real organizations.

When tools get better, expectations rise with them.

If an engineering team can produce more code, the bar for product quality, reliability, testing, and iteration speed goes up. If an analytics team can generate dashboards faster, the organization asks for deeper analysis, tighter decision loops, and more customized insights. If a design workflow gets partially automated, customers do not reward you for doing less design work. They expect better products.

This has been true across waves of software tooling. Better tools rarely make the work simpler in aggregate. They expand the frontier of what counts as acceptable work.

So today’s 99 percent solution often becomes tomorrow’s 50 percent solution, because the target moved.

That is why AI should not only be understood as an efficiency engine. It is also an ambition engine. When execution gets cheaper, the rational response is not just to defend the old scope of work. It is to ask what becomes possible now that used to feel out of reach.

AI raises the ambition frontier

This is where Garry Tan’s “boil the ocean” framing feels relevant.

In normal times, ambition gets constrained by labor, coordination cost, and execution bottlenecks. Teams are told not to boil the ocean because the ocean is simply too large for the available bandwidth.

But AI changes that equation.

When a team can explore ten product directions instead of two, inspect far more customer feedback, automate large parts of research and implementation, or compress weeks of execution into days, the right response is not to do the same work a little more cheaply. The right response is to raise ambition.

That is the deeper meaning behind the idea that today’s 99 percent solution becomes tomorrow’s 50 percent solution. Once the floor rises, the ceiling moves too. Organizations no longer compete on whether they can produce the old output more efficiently. They compete on whether they can use these new tools to build something significantly better.

This is the part that many cost-cutting narratives miss. AI does not just shrink the labor required for known work. It expands the set of projects, products, and standards that become rational to pursue. It pushes teams toward more ambitious builds, more advanced systems, richer customer experiences, and faster loops between idea and execution.

And that makes the last mile more important, not less. The more ambitious the system, the more important judgment becomes. Bigger scope, faster execution, and more powerful agents increase the need for strong review, strong governance, strong taste, and strong mechanisms for deciding what deserves to ship.

The strategic implication: the moat moves from generation to adjudication

If this is true, then the next wave of AI agent infrastructure will not be defined only by who has the most capable model or the longest autonomous workflow demo.

It will be defined by who is best at closing the gap between generated work and accepted outcomes.

That gap is where the real infrastructure lives:

verification systems that check whether outputs are trustworthy
routing systems that decide when to pass, retry, escalate, or stop
evidence capture that explains why the system believes an output is acceptable
audit trails that make the process legible after the fact
deployment and integration gates that turn artifacts into operational outcomes

This is why I keep coming back to harness engineering.

The harness is not just a wrapper around the model. It is the operational layer that determines whether an agent can be trusted in real work. It is where completion gets governed.

A strong model can generate a plausible answer. A strong harness determines whether that answer becomes production reality.

Why this matters even more for headless agents

This is especially important for headless agents in data and ML workflows.

A chat assistant has a natural escape hatch. If the answer looks suspicious, the human can ask a follow-up question, redirect the conversation, or simply ignore the output.

A headless agent does not have that luxury. It often operates against live systems, background jobs, production pipelines, schemas, dashboards, data contracts, or model evaluation workflows. There is no conversational cushion between generation and consequence.

That means the last mile gets sharper.

A data agent can produce a SQL query that looks plausible but is subtly wrong. An ML agent can recommend an evaluation change that seems reasonable but quietly breaks comparability across experiments. An analytics agent can generate an insight that sounds compelling but rests on a flawed join or missing cohort assumption.

We see the same pattern in ClaimMind. AI can help structure claim data, suggest coding directions, and reduce a large amount of manual review work, but the output still needs to be matched against hospital rules, payer constraints, and operational policy. The system can narrow the search space and surface likely ICD paths, but a human reviewer still makes the final decision on which claim interpretation and ICD code should actually be used. That final judgment is exactly where organizational accountability lives.

What is interesting is that this boundary is not fixed forever. As the system learns from organizational behavior, review patterns, escalation history, and accepted outcomes, more of that final layer may become automatable. But it only becomes safe to automate because the organization first created the review loop, the rules, and the evidence trail.

In all of those cases, the value of the system depends less on whether it generated something and more on whether the surrounding system can verify, constrain, and appropriately route the result before it is accepted.

This is why headless AI systems should be evaluated on accepted, auditable completion, not just task coverage.

What teams should build now

If the last mile is the real bottleneck, then teams building AI agents should care less about theatrical autonomy and more about completion systems.

A few things become much more important:

1. Verification before celebration

Do not confuse a fluent output with a correct one. Cheap deterministic checks should run wherever possible, and stronger semantic verification should be layered where risk is high.

2. Explicit escalation paths

Not every task should be forced through autonomy. Teams need clear rules for when an agent should stop, ask for review, or hand off to a human.

3. Evidence, not vibes

If an output is accepted, the system should be able to say why. Confidence without evidence is not enough when real workflows are at stake.

4. Completion-oriented metrics

Measure what actually matters: accepted outcomes, retry success, escalation quality, false accepts, audit readiness, and downstream usability.

5. Ambition-aware workflows

Do not use AI only to do the old work faster. Use it to explore a larger design space, attempt more valuable outcomes, and raise the standard of what the team is trying to ship.

6. Domain-specific judgment surfaces

The last mile is rarely generic. It shows up differently in engineering, finance, support, healthcare, and analytics. The harness has to reflect the domain, not just the model.

My take

AI shifts human labor from execution to adjudication.

But that is only half the story. AI also shifts the organization from local optimization toward frontier expansion, if leadership is willing to use the capability that way.

That is why I think the highest-leverage teams will not just be the ones that can get an agent to produce an artifact. They will be the ones that can raise ambition and still decide, reliably and at scale, whether an artifact deserves to become an outcome.

That is a different problem from prompting. It is a different problem from model benchmarking. And it is a much more interesting problem than most of the current discourse admits.

The biggest mistake teams can make right now is optimizing for the appearance of autonomy instead of the reality of completion, or optimizing only for efficiency instead of ambition.

The real battle is in the last mile.

The last mile is not a bug in AI agents. It is increasingly the job in the AI era.

Sources

Aaron Levie, “The never-ending last mile of work” (X thread)
Garry Tan, “Boil the Ocean”

Not Every AI Agent Should Be a Coding Agent

fitrakacamarga — Tue, 28 Apr 2026 05:18:46 GMT

One thing I’ve learned from building AI agents is this: not every AI agent should look like a coding agent.

Yes, coding agents are probably the most versatile kind of agent we have today. Give them a shell, filesystem access, web search, and enough autonomy, and they can do a surprising amount of work.

But versatility is not the same as effectiveness.

In practice, the best AI agents are not the most general ones. They are the ones designed around a specific objective — with the right tools, system prompt, skills, memory model, and operating environment.

That distinction becomes obvious when you build more than one kind of agent.Two Agents, Two Different Worlds

Over the past year, I’ve been building two AI agents that couldn’t be more different.

Agent 1: Seeknal — A Data-Native Agent

Seeknal is an all-in-one CLI for data and AI/ML engineering. Its AI agent (seeknal ask) needs to answer questions about your data, build and validate pipelines, profile datasets, and generate reports.

For this agent, the “native language” is data operations. So we designed it so that:

sql_query is a native tool — the agent queries PostgreSQL, DuckDB, and Iceberg directly
Seeknal’s own operations (organize, expose, action) are native tools, not invoked through bash
The agent has full context of your pipeline schemas, lineage, and entity definitions before you ask anything
It understands the draft → dry-run → apply workflow natively

The result: when you ask “what’s our revenue trend by region this quarter?”, the agent doesn’t spawn a Python script, write SQL to a file, execute it via bash, then parse the output. It queries the database directly through its native tool and returns the answer in seconds.

No bash. No grep. No pip install. Just the right tool for the job.

Agent 2: A Personal Agent — Simple by Design

In a different setting, I also built a simpler personal agent for daily tasks. Its job: search for information, inspect files, summarize context, and help with general workflows.

For this agent, native tools like filesystem and web_search are enough. It doesn’t need SQL-native tools, pipeline operations, or ML model management. And it works better because it’s not distracted by capabilities it doesn’t need.The Mistake I See Too Often

A lot of people start with the tools they already have — shell, browser, filesystem, search — then ask, “What can my agent do with this?”

I think that is backwards.

The better question is: what problem am I trying to solve, and what is the best environment for an agent to solve it in?

Because once you answer that, the rest becomes clearer: what tools should be native, what abstractions the agent should operate on, what skills should exist, and what kind of verification and recovery the harness should implement.

What Harness Engineering Taught Me

From our harness engineering research, we’ve been studying how to make AI agents reliable. One pattern keeps showing up: more tools doesn’t mean better performance. In fact, it often means worse.

A recent study (GraSP, arXiv:2604.17870) found that giving agents flat lists of skills actually hurts reliability. When you compile those same skills into a structured graph with typed dependencies, performance jumps — up to +19 reward and 41% fewer wasted steps.

Why? Because every tool you give an agent is a decision it has to make. More tools = more decisions = more chances to pick the wrong one.

This is why harness design matters so much:

The environment shapes the task
The task shapes the tools
The tools shape the agent’s behavior
And the harness determines whether that behavior stays useful or drifts into noise

A coding agent needs loop detection, patch verification, and file-level reasoning. A data-native agent needs schema grounding, query validation, and lineage awareness. A personal assistant needs lightweight context management and low-friction interaction patterns.

Same foundation. Different environment. Different harness.The Design Framework

When I think about building an AI agent now, I start with five questions:

What is the real objective? Not “use AI” — but what job should this agent do repeatedly and well?
What environment does that work actually live in? Shell? Database? Browser? Internal tools? Messaging layer?
What should be native tools vs. wrapped tools? Native tools shape agent fluency. If SQL is 80% of what the agent does, make it native — not something it scripts through bash.
What failure modes matter most? Wrong SQL is different from wrong code. Wrong personal summary is different from wrong production patch.
What harness makes this agent reliable in that environment? Detection, routing, guardrails, and recovery should be designed around the real task.

When to Use What

I’m not saying coding agents are bad. They’re essential for general-purpose tasks. But they shouldn’t be your default.

Use a coding agent when:

The task is unpredictable and requires exploration
You need to prototype something quickly
The environment is unknown or constantly changing

Use a domain-specific agent when:

The task is well-defined and repetitive
You have clear success criteria
The environment is structured and knowable
Reliability matters more than flexibility

The Uncomfortable Tradeoff

Building a domain-specific agent is more work upfront than building a general-purpose one. It’s easier to give an agent bash access and say “figure it out.”

But that agent will be slower, less reliable, and more expensive to run. Seeknal’s data agent can answer an analytics question in one tool call. A coding agent would need 5-10 tool calls, a temporary Python script, error handling, and output parsing to do the same thing.

The best agent is not the most general one. It is the one designed for the work you actually need done.

Start from the problem, not from the toolchain. Because sometimes bash, grep, and ls are enough. And sometimes they are exactly the wrong abstraction.

Why I Built My Own Data Pipeline Tool After 10+ Years as Data Scientist Then CTO

fitrakacamarga — Tue, 28 Apr 2026 05:18:04 GMT

I started my career as a data scientist at Indosat Ooredoo, then moved to Eureka AI in Singapore where I spent 7+ years — from writing ML models to leading engineering as Head of Engineering & ML. Eventually I became CTO, responsible for the whole data stack from ingestion to insight.

And at every single company, the same thing happened.

The Pattern

It always starts simple. You need to move data from A to B. You write some SQL. You build a pipeline. Maybe you adopt dbt because everyone says you should. Then you need ML features, so you add Feast. Then you need metrics that everyone agrees on, so you build a semantic layer. Then someone wants “self-serve analytics,” so you add a BI tool. Then someone wants to ask questions in natural language, so you wire up an LLM to your database.

Suddenly your data team is maintaining 5-7 tools that barely talk to each other.

The Breaking Point

The last straw was watching a data team spend three days trying to answer a simple business question: “What’s our revenue by region this quarter?”

Three days. Not because the data wasn’t there. Not because the team wasn’t smart. But because:

The pipeline that calculated revenue was in dbt
The region mapping was in a different system
The metric definition existed in two places with slightly different logic
Nobody could agree which numbers were “right”
And the AI chatbot they’d bolted on top had no context about any of this

Three days for a question that should take three minutes.

What I Built

Seeknal is a CLI tool that handles the full loop: define pipelines, serve features, manage metrics, and analyze data with an AI agent — all from one unified graph.

pip install seeknal

seeknal init my-project

seeknal draft   # scaffold your pipeline

seeknal dry-run   # compile, preview, check data quality

seeknal apply   # execute with incremental awareness

seeknal ask “show me revenue by region last quarter”

The workflow is inspired by Terraform and kubectl. Data engineers love dry-runs — you see exactly what will happen before it happens. No more “oops, I just overwrote the production table.”

The AI Agent That Knows Your Data

seeknal ask is an AI agent that has full context of your pipelines, schemas, and lineage. It can:

Answer questions about your data in natural language
Profile datasets and find anomalies
Build new pipelines from a description
Generate reports
Ingest files (even Excel) directly into your pipeline

It supports Google Gemini, OpenAI, Anthropic, or runs fully local with Ollama. Because sometimes your data shouldn’t leave your machine.

Why Open Source

I’ve been the person who couldn’t afford expensive data tools. I’ve been the team of one trying to build a data stack with zero budget. Open source leveled the playing field for me, so I’m doing the same.

Seeknal is Apache 2.0. No lock-in. Install it, use it, modify it, deploy it however you want.

Where It’s At

23+ releases. Production-grade pipeline execution. Incremental processing. Column-level lineage. Data quality checks. Working AI agent with tools and built-in skills. PostgreSQL and Apache Iceberg support.

If any of this resonates — if you’ve ever felt the pain of managing too many data tools that don’t work together — I’d love to hear your story. Chances are, I’ve lived it too.