April 2026

Pruning capability overhang with architecture

OpenAI defines capability overhang as the gap between what AI tools can do and how typical users are using them. In document-heavy industries, legal, financial services, healthcare, government, etc., the overhang is substantial. Not because the models are not ready but because the architecture required to deploy them well has room to grow.

A useful way to build is to start with the data, as-is, and build around it. Use hybrid retrieval, fuse it through Reciprocal Rank Fusion (RRF), and wrap it in deterministic infrastructure: indexing, orchestration, storage, and audit. Then place the model inside. The model reasons within a structure it did not have to build. The system supplies context, tools, and constraints; the model supplies judgment. We build the road, the vehicle, the guardrails, and supply the data; the model does the driving.

In practice, the pipeline might look something like this:

job definition (YAML) → runner (deterministic orchestration) → reasoning model (reads, decides, delegates) → retrieval engine (ColQwen + E5 + BM25 + exact, RRF-fused) → visual sub-agents (pdf/html → png, fan-out page readers) → structured output (verified, cited, traceable)

The reasoning model steers within its boundaries. It asks for what it needs in simple terms, and the system executes. To reduce cognitive load, we give it (the model) a menu of tools, not a manual. Visual sub-agents fan out across candidate pages; with N sub-agents operating in parallel, latency starts to scale closer to the critical path rather than linearly with page count. Every claim traces back to a cited page. Perception is separated from judgment. The result is structured, auditable, and reviewable.

If designed well, the architecture can provide properties the model does not have: parallelism, deterministic retrieval, verification by citation, and the separation of perception from judgment. None of these disappear when models get smarter, because they are orthogonal to reasoning. The model stops looking like the foundation and starts looking like a parameter. Claude today, GPT tomorrow, several in parallel when the task demands it. The architecture remains.

There are still open problems here. I still have not figured out what to do when the underlying data conflicts across documents. But so far, this approach is working well, and I think it points in the right direction.