The Training Philosophy

Table of Contents

The question the ML engineer expected to hear in the due diligence review was “how much will it cost to train thirty models?” The question she actually heard was “how much has it cost?” The distinction matters. Training thirty small models is not a hypothetical research program. It is a concrete engineering plan with a concrete budget, and the budget is smaller than most PE evaluators expect because the strategy is pragmatic rather than ambitious.

The core insight is that SSMs are harder to train but easier to deploy. Transformers are easier to train but harder to deploy on edge. The tooling maturity gap is significant: Transformers have seven-plus years of community knowledge, thousands of pretrained models, and well-documented fine-tuning recipes. SSMs have roughly two years of community development, a handful of pretrained bases, and training recipes that are still being discovered. Rather than solving the hard training problem first and the easy deployment problem second, the strategy reverses the order: start with proven Transformer fine-tuning for fast time-to-market, then progressively distill to SSMs for edge efficiency. The person gets a working product in months, not years. The engineering team gets time to solve the harder training challenges while the product is already deployed and generating revenue.

The training strategy is not “build the perfect architecture and then ship.” It is “ship with good-enough architecture and improve continuously.”

The four-phase plan
#

Phase 1 fine-tunes small Transformers on domain-specific data. The base models are proven, well-documented, and available: Phi-3-mini (3.8 billion parameters) for response generation, Gemma-2B for classification and routing, TinyLlama (1.1 billion parameters) for lightweight tasks, Whisper-tiny (39 million parameters) for voice processing. Fine-tuning uses LoRA and QLoRA, which freeze the base model weights and train small adapter layers that capture domain-specific behavior without modifying the base model’s general capabilities. The training recipes are well-known. The hardware requirements are modest: a single machine with four A100 GPUs can fine-tune a model in two to five days. The risk is low. The outcome is a working product deployed on quantized Transformers within six months. The Safety Monitor, the Intent Classifier, and the Emotion Detector ship first because they are the foundation on which all other interactions depend.

Phase 2 distills the working Transformers into SSMs through knowledge distillation. The fine-tuned Transformer is the teacher. The SSM is the student. The process has four steps: logit distillation, where the student learns to match the teacher’s output distribution; hidden state alignment, where Transformer hidden states are mapped to SSM state vectors; progressive length training, where the student trains on increasingly long sequences; and task-specific tuning, where the distilled SSM is fine-tuned on the target task’s validation set. Expected capability retention: 85 to 95% of Transformer performance at 50% of the inference cost. The 40% latency reduction that SSMs provide makes the difference between “runs on edge with acceptable speed” and “runs on edge but feels sluggish.” If distillation underperforms for a specific model, the quantized Transformer version remains the production deployment. There is a floor, not a cliff.

Phase 3 trains native SSMs from scratch for tasks where no good Transformer pretrained model exists. Sensor-domain processing is the primary target: the Mamba-Sensor base model processes physiological signals, health metrics, sleep patterns, and behavioral time-series data. No existing Transformer pretrained model covers this domain well because it has not been a focus of the large language model research community. Native SSM training is harder: hyperparameter sensitivity, custom CUDA kernels, longer training cycles, and a larger compute budget per model. But it is also where the architecture delivers the largest inference advantage, because continuous sensor monitoring is the use case where linear complexity matters most. The Health Monitor, Sleep Analyzer, Agitation Detector, and Exercise Coach are the four models that benefit most from native SSM training because they process continuous physiological and behavioral streams where Transformer attention overhead is most wasteful.

Phase 4 refines and unifies. Benchmark native SSMs against distilled SSMs to identify quality gaps and determine which approach produces better production models per task. Tune inference kernels for mobile NPU execution, targeting the specific neural processing units in the primary deployment hardware. Develop a unified serving infrastructure that handles all four architecture types through a single deployment pipeline, so the build team manages one system rather than four. Create the over-the-air model update protocol that pushes improvements to edge devices with staged rollout and automatic rollback if quality metrics decline. Phase 4 is where the system matures from working product to production-quality platform, and it runs concurrently with production operation rather than delaying it.

Synthetic data generation
#

Training domain-specific models requires domain-specific data. Medical data is expensive, regulated, and scarce. Financial data is proprietary. Cognitive assessment data is sparse and sensitive. Synthetic data generation addresses the data constraint without violating the privacy architecture.

Two synthetic data pipelines serve different roles. Nemotron 3 Nano runs locally and generates synthetic training examples from existing labeled data through paraphrase, augmentation, and scenario variation. The local generation preserves privacy: no real patient data leaves the training environment. Nemotron 340B runs in cloud burst mode for large-scale generation when volume matters more than privacy sensitivity: generating diverse conversation examples, expanding the coverage of rare medication interactions, and creating scenario variations for edge cases.

Synthetic data works for the specialized domains BlueMirror targets because the domains are well-defined. Medication interaction checking has clear correct and incorrect answers that can be validated against pharmacological databases. Cognitive assessment has established clinical scales that define the output space. Dietary guidance has nutritional facts that constrain the recommendation space. The domains are structured enough that synthetic examples can be validated automatically against domain knowledge, and invalid examples can be filtered before they enter the training set.

The honest limitation: synthetic data does not capture the full complexity of real human interaction. Margaret’s actual conversational patterns, her specific combination of conditions, her idiosyncratic communication style, cannot be synthesized from population-level data. Synthetic data trains the base capability. P-RLHF (BMT-05.02) personalizes that capability to the individual. The two are complements, not substitutes.

The risk mitigation strategy maps every training risk to a fallback. If distillation quality loss exceeds 15%, quantized Transformers remain viable for all deployment tiers. If SSM training proves unstable for a specific model, extended hyperparameter search budget is available, and a hybrid Transformer-SSM architecture serves as an alternative. If sensor-domain native SSM training underperforms, the system can fall back to cloud-processed sensor data during the research timeline. Every architectural bet in the training strategy has a working alternative. The strategy aims for the best outcome without depending on it.

University partnerships
#

Two university partnerships provide research-grade capability at startup-compatible cost. IIIT Hyderabad brings novel SSM architectures, specifically Mamba derivatives and custom state space designs optimized for the sensor and cognitive domains. Their team includes two faculty advisors in machine learning and embedded systems, four to six PhD students on core research, and four to six masters students on implementation. Their research deliverables include the distillation methodology, novel Mamba-Sensor architecture for physiological signals, ultra-low-precision training-aware quantization methods for 2-bit deployment on wearable devices, and an open-source distillation toolkit.

Purdue University brings clinical validation frameworks, healthcare domain expertise, IRB access for working with real healthcare data, and US-based regulatory understanding for FDA pathway preparation. Their team includes faculty advisors in healthcare AI and systems, with PhD and masters students focused on applied research and engineering. Their deliverables include healthcare-validated models with clinical efficacy studies, deployment best practices documentation, and quantization-aware training achieving less than 5% accuracy loss at 4-bit precision.

The partnership model is designed to avoid single-point-of-failure dependencies. IIIT Hyderabad leads the distillation methodology research and native SSM architecture development. If their specific SSM architecture underperforms, the distilled Transformer models remain viable. Purdue leads clinical validation and healthcare model training. If their clinical data access is delayed, synthetic data pipelines keep training moving. Neither partnership blocks the critical path. Both accelerate it.

The research publication strategy is part of the partnership value. IIIT Hyderabad and Purdue together target 11 to 16 published papers across the three-year partnership at venues including NeurIPS, ICML, CHI, and the Journal of the American Medical Informatics Association. The papers validate the architecture in peer-reviewed venues, which serves as credibility capital for the system’s technical claims. The BlueMirror architecture is not a pitch deck that exists only in a presentation. It is a research program with published, peer-reviewed results.

The cost structure
#

The total training budget across all four phases: approximately $1 million in compute plus $750,000 in personnel, with university cost-sharing reducing the personnel cost to roughly half what commercial-rate development would require.

Phase 1 Transformer fine-tuning: approximately $15,000 in compute across twenty models, using 4x A100 GPUs at cloud rates. Each model takes 2 to 5 days of training. LoRA fine-tuning on consumer-grade hardware is the baseline; cloud GPUs are used for speed, not necessity.

Phase 2 distillation: approximately $45,000 in compute across fifteen models, using 8x A100 GPUs. Each distillation takes 5 to 10 days because the student must train against the teacher’s full output distribution.

Phase 3 native SSM training: approximately $60,000 in compute across six models, using 8x H100 GPUs. Native SSM training is the most expensive per model because of hyperparameter sensitivity and the need for extended search.

Phase 4 optimization: approximately $30,000 in compute for benchmarking, kernel optimization, and deployment pipeline development.

The total compute budget is approximately $150,000. Not $100 million. Not $10 million. The models are small. The training data is synthetic and domain-specific. The fine-tuning approaches are parameter-efficient. The result is a training strategy that a startup can execute without needing the compute budget of a foundation model lab.

The timeline reinforces the cost structure. Each model moves from base model selection to deployed SLM in approximately eight weeks during Phase 1. The critical path to a minimum viable product is six months from the start of Phase 1: the Safety Monitor, Intent Classifier, Emotion Detector, Conversation Manager, and Response Generator are the first five models deployed, providing the core interaction capability. The remaining models follow in priority order over the subsequent six months. By month twelve, the full Transformer-based portfolio is deployed and generating revenue. Distillation to SSMs runs in parallel starting at month six, delivering edge-optimized models progressively through month eighteen. The revenue from the deployed Transformer product funds the distillation and native SSM phases. The training strategy finances itself.

Cross-References
#

BMT-06.01 Why Thirty-Seven Models, Not One. The decomposition that defines what needs to be trained, including the incrementality argument for updating individual models.

BMT-06.02 The Right Architecture for the Right Task. Architecture selection per model, which determines the training approach: fine-tuning for Transformers, distillation for SSMs, forced routing for MoE.

BMT-12.01 The Universal Personalization Framework. UPF abstraction of training for new domains, showing how the four-phase strategy extends to domains beyond the initial fifteen.

Technical Appendix BMT-06.04-A is available to partners and investors at partners.bluemirror.tech.

The four-phase plan#

Synthetic data generation#

University partnerships#

The cost structure#

Cross-References#

The four-phase plan
#

Synthetic data generation
#

University partnerships
#

The cost structure
#

Cross-References
#