The Right Architecture for the Right Task

Table of Contents

The question the ML engineer asked after reading the thirty-model decomposition was the right one: why not use the same architecture for all of them? If Transformers work well for language tasks and these are all language-adjacent tasks, why introduce SSMs, MoE routing, and hybrid architectures? The complexity cost is real. Multiple architecture types mean multiple training pipelines, multiple deployment configurations, multiple monitoring systems. The answer is in the computational profiles. Different tasks have fundamentally different requirements, and forcing one architecture onto all of them wastes parameters, increases latency, or sacrifices quality. Sometimes all three.

The BlueMirror portfolio uses four architecture types: State Space Models for temporal pattern recognition, Mixture of Experts for classification and routing, Transformers for generation, and hybrids for tasks that blend requirements. The selection is not aesthetic. It follows from the computational characteristics of each task mapped against the deployment constraints described in BMT-06.01. Each architecture type excels in a specific computational regime and underperforms in others. Using the right tool for each job produces a portfolio that is collectively more capable and more efficient than any single-architecture approach.

The architecture serves the task. The task does not bend to the architecture.

State Space Models: fourteen models
#

SSMs process sequential data with O(n) computational complexity. Transformers process sequential data with O(n-squared) complexity because of the attention mechanism. For short sequences, the difference is negligible. For the continuous monitoring tasks that BlueMirror runs persistently, the difference is the gap between feasible and infeasible on edge hardware.

Fourteen models in the portfolio use SSM architectures, built on three shared base models. The Mamba-2 base (150 million parameters) serves language and conversation pattern tasks: the Conversation Manager, the Clarification Agent, the Orientation Assistant, the Cognitive State Estimator, the Repetition Handler, the Sundowning Specialist, the Preference Learner, the Pattern Detector, and the Temporal Reasoner. Each adds a specialized task head of 10 to 25 million parameters to the shared base. The Mamba-Sensor base (80 million parameters) serves physiological and behavioral signal tasks: the Agitation Detector, the Health Monitor, the Sleep Analyzer, and the Exercise Coach. The Mamba-Audio base (80 million parameters) serves the Voice Tone Analyzer.

The shared-base architecture is the key to parameter efficiency. Nine models share the same 150-million-parameter Mamba-2 base and differentiate only through their task heads. The base encodes general sequential reasoning capabilities. The task heads encode domain-specific pattern recognition. Total stored parameters for all fourteen SSM models: approximately 500 million, down from 830 million if each model had its own base.

The SSM architecture makes specific tasks possible on edge hardware that would be impractical with Transformers. The Health Monitor processes a continuous stream of vital sign data from connected sensors. The Sleep Analyzer processes hours of overnight data. The Agitation Detector monitors behavioral signals throughout the day. These are streaming tasks: data arrives continuously, and the model must maintain state across the stream. SSMs maintain state in a fixed-size state vector that updates incrementally with each new input. Transformers must attend to the full history, with computational cost growing quadratically as the stream lengthens. For a model that runs continuously on a device with limited power and compute, linear complexity is not an optimization. It is a requirement.

The honest trade-off with SSMs is training difficulty. SSMs are sensitive to hyperparameters, require custom CUDA kernels for efficient training, and have a fraction of the tooling maturity that Transformers enjoy. The pretrained SSM base model selection is thin compared to the thousands of available pretrained Transformers. This is why the training strategy (BMT-06.04) starts with Transformers and progressively distills to SSMs rather than training SSMs from scratch. The inference advantage is real. The training cost of reaching that advantage is also real.

Mixture of Experts: eleven models
#

MoE architectures store many parameters but activate only a subset per query. An eight-expert MoE with top-2 routing stores eight experts but activates only two for any given input. The active parameter count is roughly 25% of the stored parameter count. This makes MoE models efficient for tasks that require broad knowledge but narrow application per query.

Eleven models use MoE architectures, sharing a 50-million-parameter embedding layer and an 80-million-parameter gating network. The Intent Classifier activates the expert most relevant to the query type. The Safety Monitor forces activation of its expert for every interaction, regardless of routing, because safety screening cannot be conditional. The Privacy Filter operates the same way: always on, always screening.

The domain expert MoE models handle tasks where the knowledge base is broad but any given query touches only a fraction of it. The Medication Assistant stores knowledge about thousands of medications but activates only the experts relevant to the specific drug interaction being checked. The Nutrition Guide stores dietary knowledge across cultural traditions, medical conditions, and budget ranges, but a query about Margaret’s dinner options activates only the experts relevant to her conditions, preferences, and budget. The Emotion Detector and the Empathy Responder use MoE routing to activate the affective response patterns most appropriate to the detected emotional context.

The MoE routing decisions themselves are learned, not programmed. The gating network observes which experts produce the best outputs for which input types and learns the routing policy through training. The Safety Monitor and Privacy Filter bypass routing entirely: their experts are forced active on every query because the cost of missing a safety or privacy violation exceeds any computational savings from conditional activation.

Total stored MoE parameters: approximately 625 million. Active per inference: approximately 120 million. The ratio of stored to active parameters is what makes MoE efficient for classification and knowledge tasks where breadth of coverage matters but depth of activation per query does not.

Transformers: three models
#

Three models use full Transformer architectures because their tasks require the attention mechanism’s ability to capture long-range dependencies within the input.

The Response Generator (150 million parameters) produces the natural language output that the person reads or hears. Generation quality depends on attending to the full input context: the person’s query, the MoC context package, the domain model’s output, and the interaction history. Truncating attention or linearizing it produces measurably worse text. For the model that the person directly experiences as “how the system talks,” generation quality is worth the compute cost.

The Memory Anchor (75 million parameters) uses retrieval-augmented generation to find and present relevant personal history. Retrieval requires comparing the current context against stored memories, which is fundamentally an attention operation: which stored memory is most relevant to what is happening right now? The Transformer architecture’s attention mechanism performs this comparison naturally.

The Context Compressor (75 million parameters) summarizes context packages for cross-device synchronization. Encoder-decoder Transformers remain the strongest architecture for abstractive summarization, where the output must capture the meaning of the input in substantially fewer tokens. The compression ratios the MoC system requires (85% token reduction at 95% relevance preservation) demand the full attention mechanism. An SSM-based compressor was tested during architecture selection and achieved 78% relevance at the same compression ratio. The 17-point quality gap was unacceptable for a component that determines how much context every other model receives.

Total Transformer parameters: approximately 300 million. These three models are the most compute-intensive per inference and the most likely to run in the cloud for lower-tier devices. But for the primary deployment target, they run locally with acceptable latency because the models are small by Transformer standards: 150 million parameters generates text faster than the person can read it.

The strategic implication is that the Transformer models are the fallback architecture for the entire portfolio. If SSM distillation underperforms for a specific model, the quantized Transformer version remains viable. The SSM target is better inference efficiency. The Transformer baseline is acceptable quality. The architecture strategy has a floor, not a cliff.

Hybrids: two models
#

Two models combine architectures because their tasks have mixed computational profiles.

The Speech-to-Intent model (100 million parameters) uses a Conformer architecture: SSM layers for processing the temporal audio stream combined with local attention layers for capturing acoustic patterns within short windows. Audio processing is inherently sequential (SSM territory) but phoneme and word recognition benefit from attention over local windows where the relationships between adjacent sounds determine meaning. Neither architecture alone is optimal. The combination outperforms either in benchmark testing against the target population, where voice quality varies significantly due to age-related changes in vocal production.

The Relationship Mapper (30 million parameters) uses a graph neural network with cross-attention. Social networks are graph structures: nodes are people, edges are relationships, and the properties of both change over time. GNNs process graphs natively, propagating information along relationship edges to build representations that capture network structure. Cross-attention allows the model to compare the current interaction context against the social graph to determine which relationships are relevant right now. When Margaret mentions “my daughter,” the Relationship Mapper identifies which daughter, surfaces the relationship context, and provides it to the concierge agent handling the interaction. A pure sequence model cannot process graphs efficiently. A pure graph model cannot integrate conversational context. The hybrid does both.

The selection framework
#

Given a new task, the architecture selection follows the computational characteristics. If the task processes continuous or streaming data where sequence length is variable and potentially long, SSM. If the task requires classification or routing across a broad knowledge base with sparse activation per query, MoE. If the task requires generation or retrieval where output quality depends on attending to the full input, Transformer. If the task has mixed requirements that no single architecture handles well, hybrid.

The framework is not theoretical. It produced the current portfolio through systematic evaluation. Early prototypes used Transformers for everything. The latency and power consumption on edge devices forced the decomposition. SSMs reduced latency by 40% for monitoring tasks. MoE reduced active parameters by 75% for classification tasks. The architecture selection was driven by measurement, not preference.

The total portfolio across all four architecture types: approximately 1.55 billion stored parameters, approximately 450 million active per inference. The decomposition by architecture type is SSMs at 500 million, MoE at 625 million, Transformers at 300 million, and hybrids at 130 million. The distribution reflects the task distribution: most tasks in the system are monitoring or classification (SSM and MoE territory), with generation and retrieval (Transformer territory) as the minority. The architecture mix matches the work the system actually does, not the work that AI demonstrations typically showcase.

Cross-References
#

BMT-06.01 Why Thirty-Seven Models, Not One. The decomposition rationale that produces the portfolio this article maps to architectures.

BMT-02.03 The Thirty-Seven Models. The orchestration-level view of how models are coordinated, compared to this article’s focus on why each model uses its specific architecture.

BMT-06.04 The Training Philosophy. Training strategy per architecture type, including the progressive distillation from Transformers to SSMs.

Technical Appendix BMT-06.02-A is available to partners and investors at partners.bluemirror.tech.

State Space Models: fourteen models#

Mixture of Experts: eleven models#

Transformers: three models#

Hybrids: two models#

The selection framework#

Cross-References#