BMT-02.05 Executive Summary#
BlueMirror.tech | May 2026#
Aiyana Whitehorse runs an applied research group preparing a clinical study of personalized AI in geriatric care. Her institution will not approve a deployment without understanding how BlueMirror models the people it serves. She is not asking whether personalization works. She is asking what BlueMirror calls personalization, because the word covers a wide range of implementations in the literature, and the implementation details are what clinical review committees evaluate.
The implementation is Personalized Reinforcement Learning from Human Feedback. Standard RLHF learns what humans in aggregate prefer. It produces responses that an average user finds acceptable. P-RLHF learns what a specific person prefers and maintains a separate preference model for each subscriber. Margaret prefers data first, recommendation second. Dorothy prefers the recommendation first, explanation only if she asks. Both receive responses shaped to their own preferences. The difference is the entire product. A personal AI that averages across users is a chatbot. A personal AI that maintains individual preference models is something structurally different.
The preference vector is bounded and specific. It does not contain Margaret’s medical history, financial situation, or family relationships. Those belong to the Mixture of Context. The preference vector contains how she likes to be addressed, what level of detail she expects, how much hedging she tolerates, what tone fits her, and how she responds to different framings of the same content. Approximately 50 kilobytes per subscriber. For 100,000 subscribers, that is 5 gigabytes of preference data, well inside the system’s storage allocation.
The learning cycle has six steps and runs continuously. A request arrives. The system delivers a response shaped to the current preference estimate. The outcome is observed. Explicit feedback, the thumbs-up or thumbs-down, is one signal, but it is not the primary one because most users do not provide it. Behavioral signals are more frequent and more honest: engagement length (did she read the full response or skim it), follow-up questions (did she ask for more or change the subject), action taken (did she accept the appointment preparation offer), time to next interaction (did she return shortly or stay quiet). These signals demonstrate preferences whether or not the person explicitly states them. The model updates after each interaction, writing to Zone 2 and propagating a lightweight cache to Zone 1 for subscribers with a Local Pane. The change takes effect on the next interaction.
The zone distinction matters during Phase 1. The P-RLHF learning mechanism is identical regardless of the inference substrate. Every Zone 3 interaction during Phase 1 generates training signal as well as a response. The labeled interaction record produced by each Zone 3 exchange feeds the proprietary SLM training pipeline. Individual learning and population-level model improvement share the same data source. The consent architecture, described in Series 05, gives the subscriber independent control over both: she can opt out of contributing to model improvement without affecting her P-RLHF personalization, and she can adjust her personalization without changing her contribution to model training.
Learned preferences transfer across domains with calibrated conservatism. Margaret’s preference for data-first responses, learned from medication discussions, transfers to Social Security optimization discussions. Her preference for morning appointments does not transfer to grocery delivery timing, because those are different kinds of preferences. The first reflects how her cognitive energy distributes across the day for attentive activities. The second reflects her household routine and storage logistics. The system maintains a domain similarity matrix, learned from population patterns and refined per individual, that tracks which preferences transfer and with what confidence. Each transferred preference is tried once. Positive signal reinforces the transfer. Negative or no signal weakens it. The calibration runs in the background and is not surfaced to the person.
The cold start problem for new users is addressed through three mechanisms. Starter templates draw on population-level patterns for similar demographics and self-reported preferences from onboarding. Rapid override produces enough signal from the first fifty interactions to begin replacing template defaults. By interaction 100, individual preferences dominate. By interaction 500, the system knows Margaret’s communication style, risk tolerance, and decision-making patterns better than most family members. Explicit preference setting provides a third path: Margaret can state a preference in plain language, and the system incorporates it immediately. If her stated preference contradicts her behavioral pattern consistently, the system surfaces the contradiction and lets her resolve it.
The retention economics follow directly from the learning model. The preference vector is non-transferable. A competitor entering the market today can build a comparable architecture in eighteen months. The competitor cannot build three years of Margaret’s accumulated preference signal in eighteen months because the signal requires three years of actual interaction with Margaret to accumulate. The time-based moat is structural, not aspirational. It is not a claimed competitive advantage. It is a property of how preference learning works.
The full article, including the federated learning protocol and the behavioral signal weighting methodology, is at BlueMirror.tech.
