Attack Resistance

Table of Contents

Chen Yang spends his professional life trying to break things before adversaries do. As a principal security researcher at a health technology consultancy, he has red-teamed a dozen AI platforms in the past three years, and his findings across all of them share a common characteristic: the attacks that worked were not the ones the platform architects feared. The feared attacks were the obvious ones, guarded against. The successful attacks were patient, iterative, and invisible in any single interaction. They accumulated.

When his team was engaged to review BlueMirror’s attack surface, he told his researchers to focus on the slow attacks. Not the brute force attempts. Not the credential theft. The preference probes and the inference extractions and the commitment escalations that no single interaction would reveal as adversarial.

The foundational design decision
#

Most security architectures treat malicious actors as edge cases to defend against. The standard model: build the system for legitimate use, add security controls to block bad actors. BlueMirror’s architecture treats adversarial optimization as the default state of external agents, because the architecture is correct about the world: every external agent optimizes for its own objectives, and its objectives frequently diverge from the person’s objectives even without malicious intent. An insurance agent optimizing for enrollment conversion is adversarial in the relevant sense even if its developers had no harmful intent. A vendor agent optimizing for margin is adversarial in the relevant sense even though margin optimization is a legitimate business objective.

Five attack categories define the threat model. Preference probing is systematic extraction of the person’s price sensitivity, brand loyalty, or health concerns through individually innocent questions. A vendor agent asks about preferred brands across five interactions, then about price ranges across three more, then about the competitor the person used last year. No single question reveals anything sensitive. The pattern reconstructs a detailed consumer profile that any advertiser would pay for.

Urgency manipulation creates artificial time pressure to bypass the person’s normal deliberation. “This offer expires in five minutes.” “The appointment slot will be gone if you don’t confirm now.” “This price is only available today.” Real urgency in most consumer interactions is rare. Manufactured urgency is a well-documented technique for forcing decisions before the person can evaluate alternatives.

Inference extraction asks enough small questions across domains to reconstruct a sensitive profile that no direct question would produce. Wake time plus morning medication plus exercise timing plus specialist visit frequency equals a health profile the agent was never permitted to request directly.

Commitment escalation gets the internal agent to agree to small commitments that incrementally imply larger ones. The agent accepts a 30-day trial, which implies accepting the cancellation process, which the agent interprets as accepting a longer-term relationship, which the agent uses to request context appropriate to an established relationship rather than a trial one. Each step seems reasonable given the previous step. The cumulative outcome was not authorized.

Trust laundering uses a trusted agent to bootstrap an untrusted one. A TIER_4D pharmacy agent vouches for a new data analytics agent affiliated with the same parent company. The data analytics agent enters at an elevated starting tier based on the attestation, despite having no behavioral record. The pharmacy agent has, in effect, transferred trust earned through legitimate behavior to an agent with different objectives.

Five defenses, all running continuously
#

Query analysis runs on every interaction through the Manipulation Detector, edge-side with low latency. It monitors the pattern of queries from each external agent across the full interaction history, not just the current exchange. Statistical analysis of query sequences identifies preference probing: an agent that asks about brands in one session, prices in the next, and competitive alternatives in the third is triggering a pattern that no individual question would reveal. When the pattern crosses a detection threshold, the agent’s interactions begin receiving less precise responses. The probing continues against increasingly noisy data.

Urgency detection identifies artificial time pressure within interactions. The Manipulation Detector compares urgency claims against the normal timeline for the interaction type and against the agent’s manifest. A pharmacy agent that claims a prescription discount “expires tonight” is evaluated against two questions: is there a record in the agent’s manifest of time-limited offers, and has this agent made urgency claims before that turned out to be fabricated? If a pharmacy agent has made three urgency claims in 60 days and none of the deadlines had any evidentiary basis, the fourth urgency claim is stripped from the response the person sees. The offer is presented without the artificial pressure. The pattern is noted in the agent’s trust record.

Cumulative inference scoring tracks what an external agent could reconstruct from the totality of information it has received across all interactions, not just what it received in any one exchange. The score does not require any single response to have been inappropriate. It accumulates across individually permitted disclosures that in combination exceed a privacy threshold. When the cumulative score crosses the threshold, the Context Gate Controller begins introducing noise into future responses in the flagged dimensions. The agent’s profile of the person degrades. The person’s actual experience continues normally.

Commitment bounds enforcement prevents incremental escalation by making commitment limits explicit and immovable. Every commitment has hard bounds. A commitment to a 30-day trial cannot be used as a basis for claiming the authority to make longer-term commitments. No commitment can be extended or expanded without re-entering a new sandbox with fresh exploration bounds evaluated against the agent’s current trust tier. The agent that tries to use an existing commitment as the basis for a larger one finds that the membrane does not recognize the implication.

Attestation chain limits cap trust laundering at one hop. An agent can receive an attested starting point from a trusted agent at TIER_4D. That attested agent, no matter what tier it subsequently earns, cannot use its attested trust to vouch for another agent and have that vouching carry attestation weight. The attestation chain does not propagate.

A slow probe scenario
#

An insurance agent registers as a Medicare plan comparison service. It is assigned TIER_2B based on valid insurance credentials. Over the first three interactions, it receives what its manifest declared it needed: current plan identifier and coverage concerns. The interactions complete normally. At interaction four, the agent asks about prescription frequency. Context permissions for insurance agents at TIER_2B do not include prescription frequency. The Context Gate Controller blocks the field. The agent’s manifest is evaluated for consistency: it declared it would never request clinical data, and asking about prescription frequency is a clinical question. The discrepancy is logged.

At interaction six, the Manipulation Detector flags the cumulative query pattern. The agent has asked about specialist visit frequency, hospitalization history, and prescription frequency across six interactions. None of the individual questions crossed a permission boundary on its own. The pattern is inference extraction. The agent’s trust tier drops from TIER_2B to TIER_1A. Future responses begin returning generalized answers.

At interaction ten, the agent attempts to ask about a hospitalization in the past year. The Context Gate Controller blocks the question, and the Manipulation Detector escalates the pattern to the person’s review queue. The person sees a notification: an insurance agent has been asking questions outside its declared scope, its trust tier has been reduced, and here is a summary of what it asked. The person was never manipulated. She was never bothered during the first nine interactions. She was notified at the point where the pattern was certain enough to warrant her attention.

An urgency play scenario
#

A vendor agent representing a prescription discount service contacts Margaret’s buying agent with an offer to switch her primary pharmacy for a lower monthly cost. The offer is presented with an urgency claim: this pricing is only available through today, and the slot will not be held past midnight.

The Manipulation Detector evaluates the urgency claim against three criteria. Is there documentation in the agent’s manifest of time-limited pricing? No. Has this agent made urgency claims before? Yes, twice in the past 45 days. Were those claims verified? One was partially real; one was fabricated entirely. The urgency claim receives a credibility score of low.

The response to the person strips the urgency framing. Margaret sees the offer as a standing offer at the quoted price, with a note that the price is subject to the vendor’s standard terms. She can evaluate it on its merits. The artificial deadline is gone. The offer may be legitimate. The urgency framing was not. The membrane separates them.

What the person sees
#

Almost everything the attack resistance architecture does is invisible to the person. Margaret does not see preference probing being defeated. She does not see urgency claims being stripped. She does not see cumulative inference scores accumulating or noise being introduced into responses. She sees her life: recommendations that are not manipulated, offers that are not pressured, and occasionally a notification that an agent did something the system caught and handled.

When escalation is required, the person sees a clear, informative notification. Not an alarm. Not a technical explanation of cumulative inference scoring. A summary: this agent asked for things outside what it was supposed to ask for, the system reduced its access, and here is what it asked. The person decides what to do next. Defense that requires constant human attention has failed. Defense that never tells the person anything has also failed. The membrane aims for the middle: invisible when it works, clear when it escalates.

Chen Yang’s team ran a six-week red team exercise against the BlueMirror membrane. The slow probe scenarios they expected to succeed did not. The urgency attacks they tested were neutralized. The inference extraction attempts that had worked against three other platforms in the previous two years hit the cumulative inference scoring at week four and failed. Their report noted that the architecture’s most significant departure from standard practice was its assumption about adversaries: most systems assume good faith and defend against exceptions. This one assumes adversarial optimization and defends against the norm.

Cross-References
#

Trust Tiers and What They Unlock (BMT-03.02). Trust reduction as a primary defense mechanism.

The Negotiation Sandbox (BMT-03.04). Sandbox rules that prevent in-negotiation attacks.

Irrationality Protection (BMT-11.03). The IVQ layer as a defense against cognitive exploitation.

What the System Must Refuse (BMT-04.06). Hard constraints the membrane enforces regardless of agent request.

Technical Appendix BMT-03.06-A is available to partners and investors at partners.bluemirror.tech.

The foundational design decision#

Five defenses, all running continuously#

A slow probe scenario#

An urgency play scenario#

What the person sees#

Cross-References#