Priya Raman is two weeks into a build week when she notices the latency anomaly. She is the lead orchestration engineer on the BlueMirror build team, and the dashboard she has open shows that one path through the system, the medication side-effect query, is averaging 470 milliseconds end to end. Another path, the routine schedule check, is averaging 180. A third, a cross-domain question that touches health, finance, and family, is averaging 720. None of the numbers are out of budget. All of them are different. The pattern she is looking at is not a bug. It is the shape of the architecture.
The orchestration layer is built around a single decision: one slow, deliberate brain directs many fast, specialized hands. The brain holds the full picture of the person and decides what to do. The hands execute one task each and return a result. The shape of the latency graph reflects the shape of the work. A simple request touches few hands. A complex request touches many. The brain spends time reasoning about which hands to call, in what order, with what context. That reasoning has a cost. The cost is bounded.
This separation is the architectural decision that makes BlueMirror possible. Without it, every request would need to load the full context into every processing path. Latency would balloon. Memory would saturate. Concurrent users would collapse to a handful. With the separation, the system stays inside its budget while serving 150 to 500 concurrent subscribers from a single regional Community Pane node, with privacy-critical inference delegated to each subscriber’s home device. The number is not theoretical. It is the deployment specification that the three-zone compute architecture (BMT-06.03) is designed to deliver.
The tension#
Two architectures are obvious. One general model that handles everything. Many independent models that each handle a domain. Both fail, in opposite ways, for the same underlying reason.
One general model is coherent. It sees the person whole. It can reason across health, finance, family, home, and the rest because it holds the full context. The cost is speed. A general model large enough to reason well about medication interactions and Social Security optimization and family scheduling is also a model too large to fit on edge hardware, too slow to meet sub-200-millisecond safety requirements, and too expensive to update incrementally. When the nutrition logic needs to change, you retrain the whole thing. The general model becomes a bottleneck precisely because it knows everything.
Many independent models are fast. Each one is small, focused, and easy to update. The cost is coherence. Margaret tells her health agent that her dizziness is getting worse. Her financial agent, which has no access to the health agent’s context, suggests she look at the cardiologist bill she has not paid. Her social agent, which has no access to either, schedules a video call with her grandson. The recommendations do not contradict. They simply do not know about each other. The person experiences a committee of specialists passing notes. The whole point of the architecture, that it serves the whole person, is lost.
The hybrid model resolves the tension. One brain maintains coherence. Many hands execute fast. The brain decides; the hands act. The brain is one instance per user, holding the full context. The hands are stateless, distributed, and shared across users. The brain takes the latency hit on the reasoning. The hands deliver the speed on the execution. The combined budget closes within the perceptual threshold the system needs to feel responsive.
The H-layer in detail#
The H-layer is the brain. One instance per user. Runs at the regional compute tier (Zone 2, the Community Pane described in BMT-06.03). Holds full access to the person’s Mixture of Context, the layered memory hierarchy that contains everything the system has learned about her. Applies her Personalized Reinforcement Learning from Human Feedback, the preference model that tells the system how she likes to be addressed, what data she wants to see first, what level of recommendation she trusts. The H-layer thinks.
The H-layer does five things and only five things. It performs cross-domain reasoning. The dizziness complaint, the recent diuretic adjustment, and the upcoming cardiology appointment are three signals from three domains. The H-layer relates them. No L-layer skill sees all three because no L-layer skill needs to. The H-layer holds the relationship.
It makes delegation decisions. The dizziness complaint activates the medication manager, the symptom monitor, and the cognitive state assessor in a specific priority sequence. The H-layer decides the sequence based on the request, the person’s history, and the current context. A different person with a different history would get a different sequence for the same words.
It manages multi-step planning. A care transition from hospital to home is a multi-day workflow with dependencies that span the medication manager (new prescriptions arrive), the buying agent (medical supplies arrive), the family coordination agent (the daughter is told what to expect), and the home environment agent (the bed is moved). The H-layer tracks the workflow. The L-layer skills do not know about each other.
It applies P-RLHF preferences to every response. Margaret prefers data first, recommendation second. Dorothy prefers recommendation first, explanation only if she asks. James prefers bullet points; Evelyn prefers conversation. These are learned, not configured. The H-layer reads the preference vector before generating the response and shapes the synthesis accordingly. The L-layer skills produce raw output. The H-layer styles it.
It evaluates against the Human Agency Scale to decide when the person needs to be asked before an action proceeds. Margaret’s healthcare autonomy is set at 0.6, which means most observational responses can proceed autonomously, but recommendations involving irreversible action require her approval. The H-layer makes that determination on every turn. It does not delegate the decision to a skill because the decision requires the full picture.
What the H-layer does not do is also worth naming. It does not run inference on user-facing language. It does not check medication databases. It does not assess vital signs. It does not categorize intent. Each of those is delegated to a skill that does it faster. The H-layer is slow because it thinks, and it thinks because the thinking has to be coherent.
At launch (Phase 1), the H-layer orchestration logic runs against the Zone 3 cloud reasoning layer for every subscriber. Zone 1 and Zone 2 have not yet deployed. The decomposition is identical: the H-layer holds the full picture and decides; the L-layer skills execute. The orchestration substrate is single-zone at launch. As Phase 2 brings Zone 1 online for subscribers who acquire a Local Pane, and as Phase 3 brings Zone 2 online for subscribers in served regions, the orchestration transitions to multi-zone for those subscribers without changing the H-layer or L-layer architecture. The Zone 3 layer continues throughout. For Zone 3-only subscribers, the orchestration remains single-zone, executing entirely against Zone 3, indefinitely. The code paths that the H-layer executes are the same in every phase and along every deployment path. The routing table is what changes.
The L-layer in detail#
The L-layer is the hands. Stateless. Distributed across Zone 1 (Local Pane, in the person’s home) and Zone 2 (Community Pane, the regional node) based on each skill’s privacy sensitivity and latency budget. Reusable across users. The L-layer skills do not know the full person. They know what they need to know for the task in front of them.
A skill receives a context package from the Mixture of Context router. The package contains the minimum information the skill needs to produce its output. It executes one domain-specific task. It returns a structured result to the H-layer for synthesis. It has no memory between invocations. Each call is independent.
The granularity principle is what makes this work. Skills map to user-recognizable actions. “Refill prescription” is a skill. “Make HTTP POST to pharmacy API” is too low. “Handle health” is too high. The first wastes the H-layer’s coordination capacity on glue work. The second defeats the purpose of decomposition by collapsing many decisions into one undifferentiated mass. The middle is where the architecture earns its keep.
The skills are also independently updatable. When the Medication Advisor SLM improves, no other component changes. When a new infrastructure agent is added for a new domain, no existing skill is touched. The H-layer’s delegation logic is updated to know about the new skill. Everything else stays the same. This is what allows the system to grow capability over time without rewriting the existing system every quarter.
The shared infrastructure is the second economic argument. One Cognitive State Estimator serves the health concierge, the cognitive concierge, and the earning concierge. One Safety Filter gates every output across all thirteen concierge agents. One Intent Classifier routes every incoming request. The L-layer is shared in a way the H-layer cannot be, because every person needs her own H-layer state but everyone can share the same Cognitive State Estimator weights.
How context flows#
The Mixture of Context router sits between the H-layer and the L-layer. When the H-layer delegates to a skill, the router builds a context package. The package contains what the skill needs and nothing else.
The router does four things in roughly twenty-five milliseconds. It analyzes the skill’s declared context requirements. The Medication Advisor declares it needs the medication list, current symptoms, allergies, and recent lab values. It does not need the family schedule or the financial history. It selects the minimum necessary MoC layers. Layer 0 is the core identity, always loaded. Layers 1 through 4 are loaded selectively based on the query type and the skill requirements. It applies token budget constraints. Different skills get different budgets. The Safety Filter operates under a tight budget because it must respond in twenty-five milliseconds. The Response Generator gets a larger budget because it produces the user-facing language. It packages the result and delivers it with the skill invocation.
This is the mechanism that achieves the eighty-five percent token reduction the system targets. A naive approach would load the full context, around five thousand tokens for a developed user, into every skill call. The router approach loads around eight hundred tokens for a typical query, with relevance maintained at the ninety-five percent level. The reduction matters not just for cost but for latency. The skill processes a smaller context faster. The router processes its decision faster than the skill would have processed the unnecessary context. The savings compound.
The eighty-five percent reduction is a target, not a guarantee. Cross-domain queries that touch four or five domains pull more layers and run higher. Highly specialized queries that need only Layer 0 and a fragment of Layer 3 run lower. The target is the average. It is what the budget closes on at scale.
The consistency problem#
The hardest engineering challenge in the orchestration layer is not speed. It is consistency. When the person tells the health concierge to stop reminding her about blood pressure, the financial concierge must not mention blood pressure medication costs five minutes later. The two concierges share a memory model. The memory model has to update fast enough that the second concierge sees the first concierge’s change before its next action.
The architectural decision is split. Strong consistency for preference changes. Eventual consistency for context updates that do not affect user-facing behavior.
Strong consistency means the change is visible to every component before any component proceeds. It is slower because it requires synchronization. It is necessary because the cost of an inconsistency is a contradiction the person sees. When Margaret tells the system to stop reminding her about blood pressure, every concierge that might mention blood pressure has to know before its next interaction. The latency hit is paid because the alternative, a financial concierge that brings up blood pressure medication after the health concierge was told to stop, breaks trust.
Eventual consistency means the change propagates over time. It is faster because no synchronization is required. It is acceptable for updates that do not produce visible contradictions. When the symptom monitor logs a new vital sign reading, the data eventually becomes available to every other component, but no concierge will produce a wrong response in the interim because no concierge needs the new reading immediately.
The split is a tradeoff. Strong consistency is more expensive but safer for user-facing state. Eventual consistency is cheaper but tolerable only where the staleness window is invisible. Getting the boundary right between the two is what makes the system feel responsive without producing the contradictions that would expose the multi-agent architecture to the user.
Why this matters for partners and investors#
The H-layer and L-layer separation creates three properties that show up in technical due diligence. Modularity, because new capabilities can be added as new L-layer skills without changing the H-layer. Testability, because each skill can be tested in isolation against a synthetic context package without standing up the full system. Scalability, because the L-layer skills are shared across users while only the H-layer state is per-user, which means adding a user does not require thirteen times the model capacity it requires once.
For partner architects, the integration point is the L-layer. A partner does not build into the H-layer. A partner builds a skill, registers it with the router, declares its context requirements, and receives invocations when the H-layer determines the skill is the right hand for the job. The partner’s skill is treated like any other skill. The H-layer does not care whether the skill was built by BlueMirror or by a partner. It cares whether the skill returns a structured result inside the latency budget.
This is the architecture that ships in the next twelve months. The H-layer state machine is specified, the L-layer skill interface contract is defined, and the LangGraph DAG configurations for the common workflows are drafted. The deeper specifications, including the strong-consistency synchronization protocol, the context package format details, and the consistency boundary documentation, sit in the technical appendix.
Cross-references#
The Thirteen (BMT-01.01). The concierge agents that compose from the orchestration described here. Each concierge is a coherent product of multiple L-layer skills coordinated by the H-layer.
The Five Layers (BMT-05.01). The Mixture of Context hierarchy that feeds the router described in this article. The router selects from these layers; that article specifies what each layer contains.
Why Thirty Models (BMT-06.01). The SLM portfolio that powers the L-layer skills. This article describes the skill abstraction; that article describes the models that execute under it.
The Architecture of Permission (BMT-04.SYN). The ethical framework that the H-layer enforces when it evaluates against the Human Agency Scale. The autonomy tiers governing delegation decisions are defined there.
Technical Appendix BMT-02.01-A is available to partners and investors at partners.bluemirror.tech.
