How the System Learns You

Table of Contents

Tomoko Sato spent six years building recommendation engines at a streaming platform before joining a healthcare AI company. She understood the difference between population preferences and individual preferences at a mathematical level, and it frustrated her that every system she worked on was tuned for the former. The recommendation engine learned what humans prefer. Not what this human prefers. The distinction sounded subtle. It was the entire product.

At the streaming platform, the population model predicted that viewers who watched documentary X would enjoy documentary Y. The prediction was right 60 percent of the time across the population. For any specific viewer, the accuracy was lower. Tomoko’s mother, who watched the same documentaries Tomoko did, wanted something completely different afterward. The population model could not distinguish between them because they occupied the same cluster. Both were Japanese-American women in their sixties who watched nature documentaries. The model saw the cluster. It did not see the person.

When Tomoko evaluated BlueMirror’s personalization architecture, she found something she had not seen before: a preference learning system that starts with the population and then replaces it, individual by individual, until the population model is mostly irrelevant. BlueMirror calls it P-RLHF: Personalized Reinforcement Learning from Human Feedback. Standard RLHF asks what humans prefer. P-RLHF asks what Margaret prefers. The difference is not a tuning parameter. It is a different reward model, a different update loop, and a different convergence target.

Standard RLHF trains a model that is good for the average person and wrong for every specific person. P-RLHF trains a model that is wrong for the average person and right for one.

Population RLHF produces a model tuned to aggregate preferences. Humans generally prefer shorter responses, so the model defaults to brevity. Humans generally prefer recommendations before analysis, so the model leads with “You should call Dr. Patel.” Humans generally prefer positive framing, so the model softens bad news. These defaults are reasonable for a population. They are wrong for Margaret, who wants detailed explanations with data, who wants to see the blood pressure trend before anyone tells her what to do, and who wants the bad news stated plainly because she spent forty years as a teacher and does not need to be managed.

P-RLHF maintains a per-person preference model that learns from this person’s interactions, not from the population. The preference model starts with population defaults, the starter template, and rapidly overrides them with individual signals. By interaction 50, the starter template is mostly replaced. By interaction 500, the system knows Margaret’s communication style, risk tolerance, decision-making patterns, and domain-specific preferences better than most of her family.

The learning loop
#

P-RLHF operates on a six-step loop per interaction. Step one: the context query arrives and the MoC Router selects the relevant context layers, as described in BMT-05.01. Step two: the system provides the selected context plus predicted preferences from the current individual model to the response generator. The predicted preferences tell the generator how to shape the response: detailed or brief, data-first or recommendation-first, formal or casual, with quantified confidence per preference dimension.

Step three: the concierge agent personalizes the response using both context and predicted preferences. The personalization is not cosmetic. It shapes structure, ordering, depth, and framing. A response to “What are the side effects of my new medication?” looks fundamentally different for a person who wants clinical data first versus a person who wants reassurance first. Same medical content. Different architecture of presentation. Step four: the outcome is observed through two signal types. Explicit feedback includes thumbs up or down indicators, verbal approval (“That is exactly what I needed”), or verbal correction (“No, I want the actual numbers, not a summary”). Behavioral signals include engagement length, follow-up questions, actions taken after the interaction, and conversations abandoned.

Step five: the individual preference model updates. The update is immediate, local, and applies to this person only. It does not affect any other user’s model. The update is also proportional: a single signal does not swing a preference dimension. The Bayesian posterior shifts gradually unless the signal is a high-confidence verbal correction, in which case the shift is larger. Step six: optionally, an anonymized signal from the interaction contributes to the population model through federated learning. This contribution is delayed, de-identified, and privacy-preserving. The population model improves slowly from aggregate patterns. The individual model improves quickly from personal patterns.

The six-step loop runs on every interaction. At 50 interactions per day, that is 50 preference model updates per day, 350 per week, 1,500 per month. By month three, the system has processed roughly 4,500 signals. The preference model at that point is dense with observation, not sparse with assumption.

Behavioral signals over explicit feedback
#

People rarely click feedback buttons. They always show their preferences through behavior. The signal taxonomy captures what the person does, not what she says she prefers, because the two frequently diverge.

Engagement depth: Margaret asked two follow-up questions after a brief response. The signal says she wanted more detail. The preference model adjusts toward longer, more detailed responses in this domain. Conversation termination: Margaret ended the conversation immediately after a detailed response. The signal says she wanted less. Action taken: Margaret called Dr. Patel within an hour of the system’s recommendation. The signal says the recommendation format worked for her. Action not taken: Margaret ignored the suggestion to switch pharmacies for two weeks. The signal says the recommendation was either unwanted or poorly timed, but the system cannot determine which without more data, so the confidence adjustment is small.

Correction behavior: Margaret said “No, I want to see the actual numbers.” This is the highest-confidence signal type because it is an explicit statement of preference. The preference model assigns maximum weight to verbal corrections. Repetition: Margaret asked the same question phrased differently. The signal says the first response did not match her intent. The system logs the miss and adjusts the intent classification for future similar queries.

Each signal type carries a confidence weight. Explicit verbal feedback has the highest weight. Passive behavioral signals have lower weight but accumulate through volume. Twenty instances of Margaret asking follow-up questions after brief responses carry the same weight as one verbal correction saying “give me more detail.” The accumulation model is Bayesian: each signal updates a posterior probability for each preference dimension.

Cross-domain transfer
#

Preferences learned in one domain can inform another because the same person is being served. Margaret’s preference for “data first, recommendation second” learned in health interactions transfers to financial discussions. Her trust calibration (“show me the source”) transfers from health to legal. Her decision-making tempo (deliberate, dislikes being rushed) transfers everywhere.

But not everything transfers. Margaret’s tolerance for uncertainty in grocery substitutions (high: she will try a different brand) does not predict her tolerance for uncertainty in medication changes (low: she wants to know every side effect before switching). The transfer model uses a domain similarity matrix that scores how likely a preference is to be shared across domains. Health to financial: high similarity for communication style, moderate for decision-making patterns. Health to entertainment: low similarity for most preferences. The matrix is learned from population-level patterns and refined individually as the system observes where Margaret’s cross-domain preferences diverge from the population.

A concrete example clarifies. The health concierge learned over 40 interactions that Margaret processes bad news better when it is presented with context and data rather than softened with reassurance. When the financial concierge needed to inform her that her Medicare Advantage plan was increasing premiums by $47 per month, the transfer model applied the “data-first, unsoftened” communication style learned in health. The financial concierge presented the dollar amount, the effective date, the comparison to alternative plans, and the enrollment window. It did not open with “I know this is difficult.” Margaret’s response confirmed the transfer was correct: she asked a follow-up question about the alternative plans. If the transfer had been wrong, if Margaret wanted softer framing for financial news than for health news, her behavioral signal (terminating the conversation, expressing frustration) would have updated the domain similarity matrix and reduced future health-to-financial communication style transfer.

The transfer model prevents the system from learning Margaret’s preferences independently in each domain, which would take fifteen times as long. It also prevents the system from assuming that preferences transfer when they do not, which would produce surprising and frustrating mispersonalization. The balance between these two failure modes is the design challenge, and the domain similarity matrix is the mechanism that navigates it.

The cold start problem
#

A new user has no individual preference model. The system must serve her from the first interaction using only what onboarding captured (Layer 0) and population defaults. Three mechanisms address this gap.

Starter templates are not personality types. They are practical defaults segmented by a narrow set of onboarding questions. “Do you prefer detailed explanations or quick summaries?” “Do you like to make your own decisions or hear recommendations?” Three questions during onboarding, not thirty. The onboarding is deliberately brief because long intake questionnaires create a barrier to adoption, and the answers to hypothetical preference questions are less reliable than observed behavior. The selected template provides reasonable defaults for communication style, response depth, and decision-making framing across all fifteen concierge domains.

Rapid override means the first 50 interactions generate enough behavioral signal to begin replacing starter defaults. The system tracks how quickly each preference dimension stabilizes, and it tracks the gap between template prediction and observed behavior. Communication style (brief vs. detailed, formal vs. casual) stabilizes in roughly 20 interactions. Decision-making patterns (autonomous vs. collaborative, fast vs. deliberate) take roughly 100 interactions. Domain-specific preferences (which information matters in health vs. finance, how much risk tolerance varies by domain) take 200 or more interactions in each domain.

Explicit preference setting allows the person to tell the system her preferences directly. “I always want to see the numbers.” These explicit settings carry maximum confidence weight, but the system can challenge them if behavioral signals consistently contradict them. “You told me you prefer brief answers, but you have asked for more detail eight times this week. Would you like me to give more detailed answers by default?” The challenge is gentle and infrequent. It happens only when the behavioral evidence is strong enough to justify it, and the person can override the challenge with a single confirmation. Her stated preference wins over observed behavior if she insists. The system learns. It does not overrule.

What the system cannot learn
#

Honest limitations matter more in personalization than in most architectural claims, because the promise of personalization is intimacy with the user, and intimacy that overclaims erodes trust permanently.

The system cannot learn preferences the person has never expressed or demonstrated. It does not infer that Margaret would enjoy jazz because her demographic profile correlates with jazz listeners. It does not predict that she would prefer a particular brand of tea because women of her age in her region tend to buy it. Preferences are observed, not inferred from demographics. This is a constraint, not a weakness. Demographic inference is the mechanism that produces stereotyping, and the system refuses to use it.

The system cannot learn preferences that change without behavioral signal. Margaret decided last night that she no longer wants exercise recommendations. Until she tells the system or demonstrates the change through behavior (dismissing exercise suggestions, ending conversations when exercise is mentioned), the system does not know.

The system cannot learn preferences shaped by structural barriers. Margaret who never asks about patient assistance programs because she does not know they exist. The system cannot learn a preference that was never given the opportunity to form. This connects to the equity architecture described in Series 11: the system has a responsibility to present options the person may not know about, particularly when structural barriers have prevented her from discovering them independently.

Cross-References
#

BMT-05.01 The Five Layers. Layer 2 as the materialized preference model that P-RLHF populates and continuously refines.

BMT-02.05 What the System Learns. The orchestration-level treatment of P-RLHF, showing how preference learning integrates with the H-layer/L-layer architecture.

BMT-10.04 The Retention Flywheel. P-RLHF as the mechanism that drives retention economics: the system that knows you better over time is the system you do not leave.

BMT-04.02 Earned Autonomy. Preference learning as the foundation for autonomy progression, where demonstrated trust in the system earns broader delegation authority.

Technical Appendix BMT-05.02-A is available to partners and investors at partners.bluemirror.tech.

The learning loop#

Behavioral signals over explicit feedback#

Cross-domain transfer#

The cold start problem#

What the system cannot learn#

Cross-References#