Skip to main content
  1. The Memory and Personalization Model/

The Five Layers

·2777 words·14 mins

Priya Narayan had been evaluating AI platforms for nine months when she opened the BlueMirror architecture document. She was the lead technical analyst on a PE due diligence team, and she had seen the same slide in every pitch deck: “deep personalization powered by AI.” What she had never seen was a concrete answer to a simple question: how does the system actually know the person it claims to serve?

The usual answer was a user profile. A form the person fills out during onboarding. Name, age, conditions, preferences. Maybe some behavioral tracking layered on top. The profile lives in a database row. Every query ships the whole profile into the prompt. The model reads 10,000 tokens of context to answer a question that needs 200.

Priya had done the math on one platform. Their user profiles averaged 8,000 tokens. At 50 queries per day per user, that was 400,000 tokens of context per user per day. At 100 concurrent users, 40 million tokens daily just for context loading. The inference cost alone made the unit economics impossible at any price point a consumer could afford. And the quality was poor: the model attended to everything, which meant it attended to nothing well. A blood pressure question pulled the person’s grocery preferences, financial history, and family relationships into context alongside the medication list. Irrelevant context degrades response quality. Every attention mechanism researcher knows this. Every platform she evaluated ignored it.

The BlueMirror architecture document described something different. A five-layer hierarchy called the Mixture of Contexts, where not every interaction loads everything the system knows about the person. Where the system activates the minimum context each query requires and leaves the rest dormant. Where the token budget per query drops from 10,000 to 1,800 while relevance holds at 95 percent.

Priya read the section twice. Then she asked for access to the partner site.

The architecture that scales personalization is the architecture that refuses to load everything it knows.

The Mixture of Contexts solves a problem that most AI platforms have not acknowledged: prompt engineering cannot carry personalization at scale. A rich context profile for an aging adult includes demographics, medical history, a full medication list with dosages and interactions, allergy records, family relationships with per-member trust preferences, communication style learned over hundreds of interactions, cognitive patterns that shift over time, financial accounts and benefit structures, housing details and maintenance history, daily routines, dietary restrictions, social connections, earning history, cultural background, and values. Represented as structured text, this runs 10,000 to 15,000 tokens. For a person with complex health conditions, multiple family relationships, and years of interaction history, it can exceed 20,000.

Loading all of it for every query is wasteful, expensive, and counterproductive. The model that reads 15,000 tokens of context to answer “What time is my appointment tomorrow?” is spending compute on irrelevant data and risking attention dilution across context that has nothing to do with the question. The MoC architecture eliminates this waste by organizing the person’s context into five layers with increasing depth and decreasing activation frequency.

Layer 0: Core identity
#

Layer 0 is the irreducible minimum. It loads for every interaction because every response must be calibrated to the person’s basic identity. The contents: name, age, gender, pronouns, primary language, location, communication preferences (text, voice, or visual), a cognitive baseline indicator, and emergency contact information. Roughly 100 tokens.

This layer answers the questions the Response Generator cannot produce appropriate output without: Is this person 78 or 28? Does she speak English or Spanish? Does she prefer formal or casual language? Has her cognitive baseline shifted since the last assessment? Layer 0 is never deactivated. It is the foundation on which every other layer sits.

The cognitive baseline indicator deserves attention. It is not a diagnosis. It is a single normalized score that tells the Response Generator whether to adjust language complexity, repetition tolerance, and explanation depth. A person whose cognitive baseline has declined since the last assessment gets simpler sentence structures and more frequent confirmation checks. This adjustment happens in the Response Generator, not in the concierge agent. The concierge agent does not know about the adjustment. The person does not know about the adjustment. The language simply fits.

Layer 1: Session context
#

Layer 1 captures what is happening right now. Current time and day. Recent interaction history, the last three to five exchanges. Current mood indicator from the Emotion Detector. Current cognitive state from the Cognitive State Estimator. Active tasks. Pending notifications. Roughly 200 tokens.

The response to “What should I do this afternoon?” depends on whether it is 1pm or 6pm, whether Margaret is energetic or tired, whether she has a video call in an hour, whether the buying agent is waiting for her approval on a grocery substitution. Layer 1 provides the temporal grounding that prevents responses from being generically correct but contextually wrong.

Layer 1 loads for most interactions. Only purely informational queries with no temporal dependency skip it. “What is the phone number for my pharmacy?” does not need to know what time it is or how Margaret is feeling. The MoC Router makes this distinction in under 25 milliseconds.

Layer 2: Historical patterns
#

Layer 2 is the preference model materialized as context. What has the system learned about this person over time? Communication preferences: Margaret prefers data first, then recommendations. Decision-making patterns: she is deliberate, wants time to consider, dislikes being rushed. Domain-specific preferences: she prefers morning appointments, dislikes phone calls, trusts her daughter for medical decisions, manages her own finances. Behavioral patterns: she exercises before breakfast, reads after lunch, calls her friend Ruth on Thursdays. Roughly 500 tokens.

This is the P-RLHF preference model (described in BMT-05.02) rendered as context tokens. It loads when the response needs to be personalized beyond basic identity. A scheduling question loads Layer 2 because the answer depends on whether Margaret prefers mornings or afternoons. A simple factual lookup does not.

The distinction matters because Layer 2 is where personalization lives. Without it, every response is calibrated to identity but not to preference. The system knows it is talking to a 78-year-old English-speaking woman in Gary, Indiana. It does not know that she wants the data before the recommendation, that she prefers to make her own decisions, that she trusts her daughter on health matters but not on finances. Layer 2 is what transforms a generic assistant into a personal one.

Layer 3: Deep knowledge
#

Layer 3 holds the full domain-specific detail. Complete medication list with dosages, interaction risks, and adherence patterns. Financial portfolio with account balances, income sources, and benefit structures. Family relationship details with trust levels and communication preferences per family member. Home property profile with maintenance history. Full cognitive assessment history with trend data. Roughly 1,000 tokens.

This layer loads only when the query requires deep domain knowledge. A question about a blood pressure medication activates the health section of Layer 3. A question about Medicare Part D activates the financial section. A question about the roof leak activates the home maintenance section. The MoC Router does not load the entire layer. It loads the relevant sections.

The section-level activation is what makes MoC practical. Layer 3 at full load is 1,000 tokens. But a medication question needs only the medication and allergy sections, roughly 300 tokens. A financial question needs only the accounts and benefits sections, roughly 250 tokens. The router’s query analysis determines which sections are relevant, and the irrelevant sections stay dormant.

The domain sections within Layer 3 are structured, not free-text. The medication section is a typed schema: drug name, dosage, frequency, prescribing physician, start date, known interactions, adherence rate. This structure allows the router to pull exactly the fields a query requires. A question about medication timing does not need the interaction field. A question about switching medications needs all of them. The schema design is in the partner appendix, but the principle is visible here: structured context enables surgical activation that free-text context cannot support.

Layer 4: RAG retrieval
#

Layer 4 is not a stored layer. It is a retrieval mechanism. External and historical documents pulled on demand: medical records, insurance policy details, legal documents, previous conversation summaries, published health guidelines relevant to the person’s conditions. Variable token count, typically 500 to 2,000 when activated.

Most interactions never touch Layer 4. When Margaret asks about her afternoon schedule, the answer lives in Layers 0 through 2. When Margaret asks “What did Dr. Patel say about my potassium levels at my last appointment?” the system retrieves the relevant conversation summary or clinical note. The retrieval is targeted: the system pulls the specific document section relevant to the query, not the full document.

Layer 4 activation is expensive relative to the other layers because retrieval adds latency. The MoC Router activates it only when the query references or requires specific documents that are not captured in the structured layers above.

Where the MoC lives
#

The full MoC (all five layers) resides at Zone 2 (the Community Pane regional node) for each subscriber. Layers 0 and 1 are cached at Zone 1 (the Local Pane in the home) for offline access and immediate context during low-latency interactions. The Zone 1 cache is a derivative of the Zone 2 store, not a parallel write target: updates flow from Zone 2 to Zone 1 after each authoritative change. Layer 4 (RAG retrieval) is not stored in either zone; documents are pulled from their authoritative sources on demand and discarded after the query completes.

During Phase 1 (launch), no Zone 1 or Zone 2 deployments exist for any subscriber. The full MoC resides in BlueMirror’s platform infrastructure that wraps Zone 3 (the cloud reasoning layer), and inference runs through Zone 3 under a healthcare data processing agreement. The MoC for transient queries is processed by Zone 3 and not retained beyond the inference request lifecycle. For subscribers on the Zone 3-only path (which is every subscriber during Phase 1, and remains the path for subscribers who never acquire a Local Pane and who never have a Community Pane in their region), the persistent MoC resides in Zone 3 storage under the DPA, with the same retention and use restrictions extended to persistent data. As Zone 1 deploys for subscribers who acquire a Local Pane (Phase 2) and Zone 2 deploys for served regions (Phase 3), MoC residency for those subscribers shifts toward the target architecture. Zone 3 residency continues for the Zone 3-only deployment path.

The MoC Router
#

The five-layer hierarchy is static architecture. The intelligence is in the routing. The MoC Router is a 150-million-parameter SLM that receives the raw query, analyzes the domain, the complexity, and the context requirements, and produces an activation plan: which layers to load, which sub-sections within each layer, and the token budget per layer. The router operates in under 25 milliseconds because it gates everything that follows. A slow router means a slow system.

The router performs three analyses on every query. First, domain classification: is this a health question, a financial question, a scheduling question, a social question, or a cross-domain question? Second, complexity assessment: does this require deep knowledge (Layer 3) or can it be answered from identity and preferences (Layers 0-2)? Third, document dependency: does the query reference a specific document or record that requires RAG retrieval (Layer 4)?

These three analyses produce the activation plan in a single forward pass. The router does not make three sequential decisions. It produces a structured output that specifies layer activation, section selection, and token budget simultaneously. The structured output format is fixed: a binary activation flag per layer, a section mask for Layer 3, and a token ceiling per activated layer. This fixed structure means the router’s output parsing adds zero latency.

The router’s accuracy is the single most important quality metric in the personalization architecture. A router that loads too much wastes tokens and increases latency. A router that loads too little produces generic or irrelevant responses. The target is 85 percent token reduction against naive full-context loading with 95 percent relevance maintenance. In testing, the router achieves this consistently for single-domain queries and drops to roughly 80 percent token reduction for cross-domain queries that require sections from multiple Layer 3 domains.

The cross-domain case is instructive. When Margaret asks “Can I afford the low-sodium meal delivery service Dr. Patel recommended?” the router activates: Layer 0 (always), Layer 1 (current session), the health section of Layer 2 (dietary preferences), the health section of Layer 3 (the specific recommendation from Dr. Patel, the sodium restriction), and the financial section of Layer 3 (income, budget, benefit coverage for nutrition services). Total activation: roughly 1,200 tokens. Naive full context: 12,000 tokens. The router identified the two relevant domains, loaded the relevant sections from each, and skipped everything else.

What this means for the person
#

Margaret does not know any of this. She asks a question. She gets an answer that knows her name, her medication list, her preferences, and her financial situation, in the right combination for the question she asked. The answer arrives in under two seconds. The system used 1,800 tokens of context instead of 12,000.

What this means for the evaluator is different. The token reduction translates directly to cost. At current API pricing, the difference between 12,000 tokens per query and 1,800 tokens per query is the difference between a system that costs $15 per user per month in inference alone and a system that costs $2.25. At scale, that difference is the unit economics.

The token reduction also translates to quality. Attention mechanisms degrade with irrelevant context. The model that reads 12,000 tokens of which 1,800 are relevant is attending to 10,200 tokens of noise. The model that reads only the 1,800 relevant tokens attends to signal. The response quality improvement is measurable in A/B testing, particularly for cross-domain queries where irrelevant context from one domain can confuse reasoning in another.

Limitations
#

The MoC Router is a model, and models make mistakes. The router occasionally underloads, producing a response that lacks context the person expected the system to have. “You should have known about my potassium restriction” is a failure mode that occurs when the router classified a nutrition query as simple and skipped the health section of Layer 3. The current underload rate in testing is approximately 3 percent of queries. Each underload is logged, and the router’s training data is augmented with the missed case.

The router also occasionally overloads, pulling context that the query does not need. Overloading wastes tokens and adds latency but does not produce wrong answers. It is the less dangerous failure mode, and the system tolerates a higher overload rate (roughly 8 percent) than underload rate.

Cold start is a real constraint. A new user has no Layer 2 (no learned preferences) and a sparse Layer 3 (only what onboarding captures). The system fills these layers over time through interaction, but the first 50 interactions are served primarily from Layers 0 and 1 with starter template defaults standing in for learned preferences. The cold start problem and its mitigations are the subject of BMT-05.02.

The five-layer hierarchy itself is a design decision, not a law of nature. Five layers may not be the right decomposition for every population or every use case. The current hierarchy was designed for aging adults whose context is dominated by health, financial, and family complexity. A system serving a different population might need a different layer structure. The architecture supports reconfiguration, but the current implementation is tuned for the population BlueMirror serves.

The temporal anchoring: the five-layer hierarchy and the MoC Router are designed and specified. The router is in active development. The sparse activation algorithms described in the partner appendix represent the current architectural specification, not production benchmarks. The 85 percent token reduction and 95 percent relevance targets are from controlled testing, not deployment. Production performance will be validated during the first deployment phase over the next twelve months.

Cross-References
#

BMT-02.01 The Brain and the Hands. The H-layer/L-layer orchestration architecture that consumes MoC context packages for strategic reasoning and task execution.

BMT-02.04 How a Request Becomes an Action. MoC routing as Step 2 of the full request trace, showing where context activation fits in the orchestration pipeline.

BMT-05.02 How the System Learns You. P-RLHF as the mechanism that populates Layer 2 with learned preferences, the layer where personalization lives.

BMT-06.01 Why Thirty Models, Not One. The MoC Router as one of the 30 SLMs in the intelligence portfolio, with its architecture selection rationale.

Technical Appendix BMT-05.01-A is available to partners and investors at partners.bluemirror.tech.