When Things Break

Table of Contents

Renata Volkov is a systems reliability engineer. She has spent nine years building failure-tolerant distributed systems for healthcare companies, and she knows that architecture diagrams describe what works. Reliability engineering describes what breaks. When she reviewed the BlueMirror three-zone architecture, she did not ask how it works when everything is connected. She asked what happens when the internet goes down at 2:00 AM in a seventy-four-year-old’s apartment.

The answer depends on the subscriber’s deployment path. That dependency is the architectural reality: a system with more zones has more failure surfaces, but it also has more graceful degradation paths.

Network outage: internet down
#

For Path A and Path B subscribers (Zone 1-Dedicated), the Local Pane device continues operating offline. It runs the Zone 1 model portfolio on local compute: the Safety Filter, Privacy Filter, Cognitive State Estimator, Emotion Detector, Orientation Assessor, Confusion Detector, Speech-to-Intent, and Voice Tone Analyzer. Safety monitoring continues. Medication reminders continue (the medication schedule is cached locally). Basic conversation through the voice interface continues, powered by the locally running models. The system cannot answer complex questions that require Zone 2 or Zone 3 reasoning. It cannot consult the Domain Expert models. It cannot perform cross-domain analysis. But it is present, it is monitoring, and it can respond to the subscriber’s immediate needs.

For Path C and Path D subscribers (Zone 1-Phone), the phone continues running the local Tiny LMs. The same functions are available as on the dedicated device: safety monitoring, medication reminders, basic interaction. The degradation is bounded by the phone’s battery and the phone’s own connectivity. If the subscriber’s home internet is down but cellular is available, the phone can route upstream queries over cellular. If cellular is also down, the phone operates fully offline with the same capabilities as the dedicated device minus the always-on sensor hub. The phone’s battery life constrains how long offline operation can continue.

For Paths E and F subscribers (No Zone 1), the system is unavailable for the duration of the outage. Every inference requires network connectivity to Zone 2 or Zone 3. When the network is down, the subscriber cannot interact with the platform. Safety monitoring stops. Medication reminders stop. The subscriber is offline.

This is the architectural cost of the No Zone 1 path. The architecture does not hide it. The subscriber who enrolls on Path E or F is informed during onboarding that the system requires an internet connection to operate and that network outages will interrupt service. The equity argument (BMT-09.04) holds because the product capability is the same during connected operation. The resilience posture is different.

Zone 2 regional node down
#

For Paths A and C (Zone 1 present, Zone 2 normally available), the subscriber’s normal Zone 2 inference path is unavailable. Queries that would have routed to Zone 2 instead route to Zone 3. The subscriber experiences slower responses because Zone 3 inference includes a longer network round-trip compared to the regional Zone 2 node. Service is maintained. The concierge still answers. The answers still draw on the subscriber’s full MoC context (loaded from backup at Zone 3). The latency increase is perceptible but not disabling.

For Path E (No Zone 1, Zone 2 normally available), the same fallback to Zone 3 applies, but with an additional consequence: the privacy-critical processing that was running in Zone 2 also shifts to Zone 3. The subscriber’s cognitive, emotional, and voice data that Zone 2 was processing regionally now processes at Zone 3 under the same DPA. The privacy posture shifts from regional to cloud for the duration of the outage.

For Paths B, D, and F (no Zone 2 coverage), a Zone 2 outage has no impact. These subscribers were already routing to Zone 3 for all inference that Zone 1 did not handle locally, or for all inference in Path F’s case.

Zone 3 outage
#

Zone 3 outages are rare but consequential. The cloud reasoning layer, operated by a commercial API provider under an SLA, has higher redundancy than any single Zone 2 node. But it can go down.

For Paths A and C (Zone 1 present, Zone 2 present), Zone 1 continues operating. Zone 2 continues operating. Queries that needed deep reasoning from Zone 3, the 10 to 15 percent of queries that require cross-domain analysis or exceed Zone 2 capacity, are queued. The subscriber experiences degraded capability: “I can help with your medications and check in on how you are feeling, but I need to wait for the connection to come back before I can analyze your insurance options.” The degradation is specific and disclosed.

For Paths B and D (Zone 1 present, no Zone 2), Zone 1 continues operating for privacy-critical functions. Everything else is queued or degraded until Zone 3 recovers.

For Paths E and F (No Zone 1), the system is unavailable until Zone 3 recovers. This is the same total-outage condition as a network outage for these paths, because all inference depends on upstream connectivity.

The Zone 3 provider’s SLA specifies recovery time objectives. The DPA includes 99.5 percent monthly uptime guarantees, which translates to a maximum of approximately 3.6 hours of downtime per month. Actual historical uptime for major API providers exceeds 99.95 percent. BlueMirror maintains contracts with a secondary API provider for failover routing: if the primary provider is unreachable or breaches latency thresholds, the orchestration layer routes new queries to the secondary provider within minutes. The failover is transparent to the subscriber. She does not know which provider is processing her query, and the response quality is equivalent because BlueMirror’s prompt architecture is provider-portable.

The practical frequency of a full Zone 3 outage affecting subscribers is low. Partial degradation (increased latency, reduced throughput) is more common than total outage. The system handles partial degradation by queuing non-urgent queries and prioritizing safety-critical and time-sensitive requests. A subscriber asking about her afternoon medication gets an immediate response; a subscriber asking for a detailed analysis of her supplemental insurance options gets a brief acknowledgment and a completed response when throughput normalizes.

Local Pane device failure
#

Hardware failure of the dedicated Zone 1-Dedicated device affects Path A and Path B subscribers. The device stops operating. Zone 1 functions (local safety monitoring, local privacy processing, sensor hub, offline resilience) are lost until the device is repaired or replaced.

Fallback options are immediate and do not require waiting for a replacement device. If the subscriber has a smartphone capable of hosting the Tiny LMs, she can install the BlueMirror app and convert temporarily from Path A to Path C (or from Path B to Path D). The app downloads the model portfolio, and local privacy processing resumes on the phone. She loses the sensor hub and the always-on monitoring but regains Zone 1 inference. If the subscriber does not have a qualifying phone, she operates on Path E (if Zone 2 is available) or Path F (if it is not) until the replacement device arrives.

The replacement device ships from inventory. Target replacement time: 3 to 5 business days for PACE-enrolled subscribers (the PACE program maintains spare inventory), 5 to 7 business days for other channels. The MoC follows the subscriber throughout the transition. No interaction history, preference data, or personalization is lost because the MoC is authoritative at Zone 2 (or Zone 3 for paths without Zone 2). The Local Pane device caches MoC Layers 0 and 1 locally, but the authoritative copy is upstream.

Phone failure or phone replacement
#

Subscribers on Paths C, D, E, and F who lose, break, or replace their phone experience a temporary service interruption limited to the time it takes to install the app on the new or repaired device.

For Path C and D subscribers, the replacement phone may or may not qualify for Zone 1-Phone. If the new phone qualifies, the app downloads the Tiny LM portfolio and the subscriber resumes on her original path. If the new phone does not qualify (a lower-spec replacement, for example), the subscriber temporarily migrates to Path E or F until she acquires a qualifying device. The migration is automatic and logged.

For Path E and F subscribers, the phone replacement reinstalls the app without the Tiny LM download. Service resumes immediately after installation.

The MoC follows the subscriber across devices. Phone replacement does not affect the subscriber’s interaction history, preferences, or personalization. The app authenticates the subscriber on the new device, confirms her identity through the same enrollment verification method used at initial onboarding, and reconnects to her MoC at Zone 2 or Zone 3. The transition is designed to feel seamless: the subscriber picks up her new phone, installs the app, and the concierge greets her by name with full knowledge of her history.

Disclosure during degradation
#

Whatever the failure mode, the system tells the subscriber what is degraded and what still works. The disclosure is in plain language, calibrated to her situation.

A Path A subscriber during a network outage hears: “Your internet connection is down. I can still help with safety monitoring, medication reminders, and basic questions. I will handle complex questions when the connection is back.”

A Path F subscriber during the same outage hears nothing, because the system is offline. When connectivity returns, the system acknowledges the gap: “I was unavailable for the past three hours because your internet was down. I am checking in now. How are you feeling?”

A Path A subscriber during a Zone 2 outage hears: “I am having a slower day, my regional connection is down. I can still help with everything, but responses may take a little longer than usual.”

The disclosure is path-aware. A Path A subscriber and a Path F subscriber experience different degradation during the same incident because different components are affected. The system does not deliver a generic “we are experiencing issues” message. It describes the specific impact on the specific subscriber. The disclosure draws from the subscriber’s deployment path configuration and the current system status to produce an accurate, plain-language description of what works and what does not.

The system also notifies designated emergency contacts when degradation exceeds a severity threshold. If a Path A subscriber loses Zone 1 and Zone 2 simultaneously (device failure plus regional node failure), the system escalates to her emergency contacts: “The monitoring system for [subscriber name] is currently operating in reduced capacity. Safety monitoring through the home device is temporarily unavailable.” The escalation is automatic, configurable, and logged (BMT-04.04).

Cross-References
#

BMT-09.01 The Three-Zone Architecture. The deployment paths whose failure modes this article describes.

BMT-06.03 Edge Intelligence. The zone characteristics that determine what runs where and therefore what is lost when a zone fails.

BMT-04.06 What the System Must Refuse. The refusal architecture that includes refusing to operate beyond degradation thresholds where the system cannot guarantee safety.

Technical Appendix BMT-09.05-A is available to partners and investors at partners.bluemirror.tech.

Network outage: internet down#

Zone 2 regional node down#

Zone 3 outage#

Local Pane device failure#

Phone failure or phone replacement#

Disclosure during degradation#

Cross-References#