Wei Chen has spent eleven years building production ML systems for healthcare companies that are now mostly defunct. She is on a due diligence call with the BlueMirror technical team because the fund she advises is considering a position. The question she has come to ask is not whether the architecture is interesting. It is whether the architecture is real. Specifically, whether the thirty small language models the BlueMirror specification describes are the system that runs today, or the system the company wishes it had.
The answer she gets is concrete. Thirty models is the target portfolio, the engineering destination the system is being built toward over twenty-four to thirty-six months. At launch, the system runs entirely on a commercial cloud reasoning layer (Zone 3 in the three-zone architecture described in BMT-06.03). No proprietary models run yet, in any zone. Zone 3 handles every query for every subscriber. Over time, proprietary models trained on real subscriber interaction data deploy first to the Local Panes of subscribers who acquire one (Zone 1, in Phase 2), then to regional Community Pane nodes where they are deployed (Zone 2, in Phase 3). Zone 3 continues throughout, handling deep reasoning that exceeds Zone 1 and Zone 2 capacity and serving subscribers who never acquire a Local Pane or whose region never gets a Community Pane.
Wei appreciates the specificity. She has seen enough AI companies describe their target architecture as their current architecture. BlueMirror describes both, distinguishes them, and explains the pipeline that connects them.
Why not one model#
The constraints that force decomposition are five, and they compound.
Latency is the first. The Safety Filter must respond in fifteen milliseconds because it gates every output the system produces. A large general-purpose model cannot meet that latency at the edge. A 120-million-parameter model optimized for safety classification can. For subscribers with a Local Pane, the Safety Filter runs in Zone 1 and meets the budget by eliminating the network hop entirely. For subscribers without a Local Pane, the Safety Filter runs in Zone 2 or Zone 3 with a tighter budget that the system absorbs through parallelism and caching. Either way, the Safety Filter is its own model rather than a function of a monolithic one.
Privacy is the second. The Cognitive State Estimator processes the most sensitive data the system handles: behavioral signals that reveal cognitive trajectory. For subscribers with a Local Pane, this model runs in Zone 1 and the underlying behavioral data never leaves the home. A monolithic model could not provide this guarantee because it could not be split into a privacy-critical local component and a cross-domain remote component. For Zone 3-only subscribers, the Cognitive State Estimator runs in Zone 3 under the data processing agreement that governs the cloud reasoning layer. The privacy posture is weaker than the Local Pane provides, but the architectural property that this is a discrete, audited, replaceable model rather than an opaque part of a monolithic system matters for both deployment paths.
Incrementality is the third. When the Nutrition Advisor needs improvement because new dietary research changes recommendations, the team retrains a single small model on a focused dataset. With one general model, the same improvement requires retraining the entire system. The capacity to update one component without disturbing the rest is what allows the system to evolve continuously rather than in disruptive quarterly releases.
Cost is the fourth. The total development cost for the SLM portfolio is approximately $600,000 to $1 million over twenty-four months, executed through university research partnerships in India (BMT-06.04). A general-purpose model trained from scratch on healthcare-specific data would cost several million and would still need per-task fine-tuning.
Deployability is the fifth. The thirty models distribute across a three-zone compute architecture (BMT-06.03). Privacy-critical models target Zone 1 for subscribers who have a Local Pane. Heavy inference models target Zone 2 where a Community Pane has deployed. Models that exceed Zone 2 capacity, novel queries, and the full inference workload for Zone 3-only subscribers run in Zone 3. The decomposition allows each model to be deployed where its task requirements dictate, across multiple deployment paths. A monolithic model cannot be split across zones with different privacy boundaries and cannot be selectively deployed to different subscribers based on their hardware situation. A decomposed portfolio can.
The target portfolio#
The thirty models organize into five categories of roughly six models each. The categories track the functional layers of the system.
The Core Interaction category handles real-time user-facing language. The Response Generator produces conversational output. The Intent Classifier categorizes incoming requests by domain and sub-domain. The Emotion Detector recognizes emotional state from text and voice. The Empathy Responder generates emotionally calibrated responses. The Clarification Generator produces follow-up questions when requests are ambiguous. These models range from 100 to 400 million parameters, with inference latency targets under 100 milliseconds.
The Memory Care category specializes in cognitive support. The Orientation Assessor performs time, place, and person checks. The Cognitive State Estimator detects lucidity and cognitive fluctuation from behavioral signals. The Confusion Detector identifies disorientation patterns from conversation flow. The Reminiscence Prompter generates life-story engagement prompts. The Simplification Engine adjusts language complexity based on cognitive state. These models range from 70 to 200 million parameters, with the Cognitive State Estimator processing 30-second behavioral windows rather than real-time per-token inference.
The Domain Expert category provides specialized knowledge. The Medication Advisor handles drug interaction checking. The Nutrition Advisor generates dietary recommendations. The Exercise Coach suggests mobility activities. The Sleep Pattern Analyzer assesses rest quality from temporal data. The Financial Advisor and Legal Advisor handle their respective domains. These models range from 100 to 200 million parameters.
The Routing and Safety category gates the system’s behavior. The MoC Router selects context layers per query. The Safety Filter validates outputs for harmful content. The Privacy Filter detects personally identifiable information before any outbound transmission. The Escalation Classifier decides when human intervention is needed. The Trust Evaluator scores external agents in the Blue Pane membrane context. These models range from 80 to 150 million parameters, with the Safety and Privacy Filters targeting sub-15-millisecond inference because they gate every output.
The Specialized Function category handles sensor and analytical tasks. The Speech-to-Intent model converts voice commands to structured intents. The Voice Tone Analyzer extracts emotional tone from speech. The Temporal Pattern Detector finds patterns in time-series behavior. The Anomaly Detector flags deviations from established baselines. The Summary Generator produces conversation and event summaries.
Total target portfolio: approximately 2 billion parameters across thirty models. After INT4 quantization, total storage footprint is approximately 1 gigabyte.
What runs at launch#
What runs at launch#
The launch portfolio is smaller than the target. Describing it accurately matters more than describing the target.
At launch (Phase 1), no proprietary models run in any zone. Zero. Zone 1 has not deployed for any subscriber. Zone 2 has not deployed in any region. The system runs entirely on Zone 3, the cloud reasoning layer, which is fulfilled by a commercial cloud inference provider operating under a healthcare data processing agreement.
The reason for this is pragmatic, not aspirational. Training thirty domain-specific SLMs from scratch before serving a single subscriber would require eighteen to twenty-four months of pre-revenue development, with models trained on synthetic data alone and no real interaction signal to validate them. Launching on Zone 3 inverts this. The system deploys within six months. Subscribers interact with it. Every interaction generates training signal that no amount of synthetic data can replicate. The interaction data is the raw material for the proprietary SLMs that will eventually deploy to Zone 1 and Zone 2.
The orchestration logic at launch is identical to the orchestration logic at maturity. The H-layer decomposes the task, delegates to L-layer skills, and synthesizes the response (BMT-02.01). The substrate that fulfills each skill differs across phases. At launch, every skill resolves to a Zone 3 inference call. At Phase 2, the privacy-critical skills (for subscribers with a Local Pane) resolve to Zone 1 inference. At Phase 3, the routine skills (for subscribers in regions with a Community Pane) resolve to Zone 2 inference. The code paths do not change. The endpoints do.
The migration path#
Over twenty-four to thirty-six months, proprietary SLMs trained on real subscriber interaction data deploy first to Zone 1 (for subscribers with a Local Pane) and then to Zone 2 (in regions with a deployed Community Pane). The process is described in full in BMT-06.04, but the portfolio-level view matters for this article.
Months 0 to 12 (Phase 1 maturity): The system runs entirely on Zone 3. Real interaction data accumulates. The India university teams (IIIT Hyderabad, IIT Madras) pretrain V0.5 SLMs on synthetic data generated through the Zone 3 inference layer. No models have deployed to any subscriber yet.
Months 12 to 18 (Phase 2 begins): Subscribers who acquire a Local Pane gain Zone 1. The V0.5 Tiny LMs deploy to those subscribers’ devices: Safety Filter, Privacy Filter, Cognitive State Estimator, Emotion Detector, Speech-to-Intent, plus the remaining Zone 1 portfolio as it becomes ready. For subscribers with a Local Pane, the privacy-critical workload shifts from Zone 3 to Zone 1. Everything else still routes through Zone 3. For subscribers without a Local Pane, the system continues to run entirely on Zone 3. The launch architecture has not changed for them.
Months 18 to 30 (Phase 3 begins): Zone 2 regional nodes deploy in the first markets. V1.0 SLMs for routine query classes (medication reminders, appointment scheduling, simple benefits questions) pass A/B quality validation against Zone 3 and deploy to Zone 2. For subscribers in those markets, routine queries shift from Zone 3 to Zone 2. Zone 3 continues to handle complex queries and to serve subscribers outside the deployed regions or without a Local Pane.
Months 30 to 36 (Phase 3 maturity): The full thirty-model portfolio is deployed to the zones the architecture targets. For a subscriber with all three zones (Zone 1 + Zone 2 + Zone 3), inference distributes roughly 15 to 20 percent in Zone 1, 55 to 60 percent in Zone 2, and the balance in Zone 3 for queries that exceed regional capacity. For a Zone 2 + Zone 3 subscriber, the Zone 1 fraction shifts to Zone 2 or Zone 3 based on the privacy classification of each query. For a Zone 3-only subscriber, 100 percent of inference runs in Zone 3 throughout all phases.
The portfolio is not static at month thirty-six. The training pipeline continues. The models improve. New models are added as the platform expands to new domains. Zone 3 continues to do deep reasoning that exceeds Zone 2 capacity and to serve every subscriber whose path includes Zone 3 (which is all of them, since Zone 3 is always present). The architecture grows. It does not retire Zone 3.
The right architecture for the right task#
The portfolio uses four architecture types. The choice is per-model and justified by the task.
State space models handle temporal pattern recognition with linear computational complexity. The Anomaly Detector, the Temporal Pattern Detector, the Sleep Pattern Analyzer, and the Health Monitor process time-series data where linear scaling matters. A transformer’s quadratic attention overhead would dominate inference cost on long sequences.
Mixture of experts provides parameter efficiency for classification and routing. The Intent Classifier, the Safety Filter, and the MoC Router need broad knowledge but activate only relevant expert sub-networks per query. Most parameters are dormant during any single inference.
Transformers deliver attention quality for generation. The Response Generator, the Empathy Responder, and the Summary Generator need the full attention mechanism to produce coherent, contextually appropriate text.
Hybrids combine architectures for tasks that need multiple capabilities. The Cognitive State Estimator combines temporal pattern recognition with discrete state classification because cognitive assessment needs both continuous monitoring and categorical output.
Each choice is a tradeoff. The tradeoff is documented per model in the technical appendix with measured performance comparisons against alternatives that were considered and rejected.
The deployment distribution#
At Phase 3 maturity, the thirty models distribute across the three zones based on privacy sensitivity, latency requirements, and computational demands. The distribution is the target architecture; the actual placement for any given subscriber depends on which zones she has.
Zone 1 (Local Pane) targets approximately 850 million parameters across eight to ten models: Safety Filter, Privacy Filter, Cognitive State Estimator, Emotion Detector, Speech-to-Intent, Voice Tone Analyzer, Orientation Assessor, Confusion Detector, and the remaining privacy-critical models. For subscribers with a Local Pane, these models run locally. For subscribers without, the same task functions run in Zone 2 or Zone 3.
Zone 2 (Community Pane) targets approximately 1.15 billion parameters across the remaining twenty-two models: Core Interaction models, Domain Expert models, the MoC Router, and the remaining Routing, Safety, and Specialized Function models. For subscribers in regions with a deployed Community Pane, these models run regionally. For subscribers in regions without, the same task functions run in Zone 3.
Zone 3 (cloud reasoning layer) hosts the cloud-native inference layer that always serves the queries Zone 2 cannot handle and that serves the Zone 3-only subscribers end-to-end. Zone 3 also hosts the cross-cutting infrastructure that operates independently of subscriber-specific inference: FSSVA coordination, model update distribution, anonymized analytics, and BGO marketplace metadata. The Zone 3 inference layer is permanent. The subscriber paths that depend on it are first-class deployments, not degraded fallbacks.
The total portfolio of approximately 2 billion parameters, quantized to roughly 1 gigabyte, distributes across Zone 1 and Zone 2 for subscribers who have both, with Zone 3 always available for queries that exceed regional capacity and for subscribers whose path includes only Zone 3. The architecture is not designed to run any zone at capacity. Headroom allows for model size increases as research advances, additional models as new domains are added, and growth in concurrent subscribers beyond the initial design point.
Cross-references#
BMT-02.02 The Thirty-One. The infrastructure agents that invoke these models. Each agent’s deployment preference drives which models it calls and from which zone.
BMT-06.01 Why Thirty Models, Not One. The strategic rationale for the portfolio approach at full depth.
BMT-06.03 Edge Intelligence. The three-zone compute architecture that defines where each model runs and how the edge intelligence envelope expands over time.
BMT-06.04 The Training Philosophy. The synthetic-to-proprietary pipeline that produces the models described here, including the India university partnerships and the API-to-SLM migration timeline.
Technical Appendix BMT-02.03-A is available to partners and investors at partners.bluemirror.tech.
