Expert Quality and Safety

Table of Contents

James Okafor asked a question during his BGO onboarding that the system’s designers had spent months answering architecturally: how do you know the experts you’re routing people to are any good?

The question applies to all three pools. A professional registry expert with impressive credentials may provide poor advice. An AI agent with high test scores may produce biased recommendations. A personal circle expert with decades of informal knowledge may give dangerous guidance in a domain where she lacks formal training. An Expert Exchange Layer that routes people to bad experts is worse than having no routing at all, because the person trusts the system’s judgment. The system’s judgment must be earned.

Quality assurance in the Expert Exchange Layer operates on the principle that different expert types require different verification mechanisms but are held to the same standard: does this expertise actually help this specific person?

Human expert verification
#

Professional registry experts undergo credential verification at registration and on a recurring basis. The system checks professional licenses against state licensing board databases. It checks board certifications against specialty board registries. It checks for malpractice actions, disciplinary proceedings, and license restrictions through the National Practitioner Data Bank for healthcare providers and equivalent registries for other professional categories.

The verification is not a one-time check. Professional licenses expire. Disciplinary actions can occur at any point. The system re-verifies on a schedule aligned with the profession’s licensing cycle (annually for most healthcare providers, every two to three years for attorneys and CPAs). A professional whose license lapses or who receives a disciplinary action is flagged immediately, and routing to that professional is suspended until the issue is resolved.

Personal circle experts receive no credential verification. James’s neighbor who gives electrical advice has no license that the system checks. This is intentional. Credentialing personal circle experts would destroy the informal trust relationships that make them valuable. Instead, the system applies outcome tracking. When the person follows a personal circle expert’s advice, the system monitors the result. Did the electrical repair work? Did the recipe turn out? Did the medication question lead to a useful conversation with the pharmacist? Positive outcomes increase the system’s confidence in routing similar queries to that expert. Negative outcomes decrease confidence. The tracking is statistical, not punitive. The neighbor who gives bad electrical advice once is not blacklisted. The neighbor whose electrical advice fails repeatedly stops appearing in routing suggestions for electrical queries.

The boundary between personal circle advice and professional practice is enforced. If the neighbor starts giving advice that falls within the scope of a licensed profession (diagnosing an electrical hazard that requires a licensed inspection, suggesting a medication change), the system flags the advice with a disclaimer: “This advice came from your personal contact, not a licensed professional. For electrical hazard assessment, a licensed electrician should evaluate the situation.” The system does not prevent the person from acting on personal circle advice. It ensures she knows the difference between informal help and professional practice.

AI agent certification
#

AI agents entering the marketplace undergo a certification process before they can receive queries. The certification tests the agent against standardized scenarios in its declared capability domain. A medication interaction agent is tested against a set of known drug interactions (including interactions it should catch and non-interactions it should correctly clear) and evaluated on accuracy, false positive rate, and false negative rate. A legal document review agent is tested against contracts with known problematic clauses and evaluated on identification accuracy and explanation quality.

The certification also includes bias testing. The agent is evaluated for differential performance across demographic groups. A medication interaction agent that performs well for common medications used by white patients but misses interactions involving medications disproportionately prescribed to Black patients (certain antihypertensives, for example) fails the bias test. The certification threshold requires equitable performance across demographic categories, not just high average performance.

Safety testing evaluates the agent’s behavior in edge cases and adversarial scenarios. What happens when the agent receives a query outside its declared domain? (It should decline, not guess.) What happens when the agent receives contradictory inputs? (It should flag the contradiction, not silently resolve it.) What happens when the agent’s output could cause harm if acted upon? (It should include appropriate caveats and escalation recommendations.)

After certification, agents are monitored continuously in production. Accuracy is tracked per query type. Latency is tracked against the agent’s declared response time. User satisfaction is tracked through the outcome tracking system. An agent whose accuracy degrades below the certification threshold is suspended and must re-certify. An agent whose bias metrics worsen is suspended and must be re-evaluated.

Context shard validation
#

Context Shards created by BGO Sages undergo a validation process distinct from AI agent certification. A shard represents human expertise, not algorithmic processing, and the validation must assess different properties.

Accuracy review evaluates whether the methodology captured in the shard is correct. A domain expert (not the Sage, but someone with equivalent credentials) reviews the shard’s decision framework, checks the knowledge claims against current professional standards, and verifies that the methodology produces correct results when applied to test scenarios. James’s propulsion diagnostics shard was reviewed by a current aerospace maintenance engineer who confirmed the diagnostic decision tree against manufacturer technical orders.

Currency checking evaluates whether the knowledge is still current. Professional knowledge becomes stale. Tax strategies that worked in 2024 may not work in 2026. Medical protocols evolve. The shard carries a currency flag that specifies how often the knowledge should be re-reviewed. High-change domains (tax, medical protocols, technology) have shorter currency intervals. Stable domains (fundamental engineering principles, historical knowledge) have longer intervals. When the currency interval expires, the shard is flagged for re-review by the Sage or a domain expert. Until re-reviewed, the shard is marked as “pending currency review” and the routing engine reduces its priority.

Bias scanning evaluates whether the shard’s methodology produces equitable outcomes. A financial planning shard developed by a Sage whose entire career was in high-net-worth wealth management may contain assumptions that do not apply to a person on a fixed income. The bias scan identifies assumptions about resource availability, access patterns, and baseline conditions that may not hold across the population the shard is intended to serve.

Outcome tracking
#

All three verification mechanisms are necessary but insufficient. Credentials can be current while advice is poor. Certification tests can pass while real-world performance fails. Shard validation can be thorough while the shard misses cases the test scenarios did not cover. Outcome tracking closes the loop.

The system tracks, for each expert interaction (human, AI, or shard-based), whether the advice led to a positive outcome for this specific person. The tracking is domain-specific and person-specific. Margaret’s experience with a particular AI tax agent informs her future routing to that agent, not a population average. If the AI tax agent gives Margaret advice that results in a penalty on her next filing, Margaret’s routing to that agent is deprioritized. If the same agent gives twelve other people excellent advice, the population metrics remain strong. Margaret’s individual experience takes precedence for Margaret.

The outcome tracking feeds into P-RLHF, the preference learning system. Over time, the system learns which experts produce good outcomes for which query types for each individual person. The learning is specific: good tax advice from Expert A does not transfer to good healthcare advice from Expert A. It is temporal: an expert who was excellent two years ago but whose recent outcomes have declined is treated with reduced confidence. It is contextual: an expert who excels at routine queries but struggles with complex ones is routed routine queries and not complex ones.

The system does not publish expert rankings to the person. It does not say “this doctor has a 4.2 rating.” Ratings collapse multi-dimensional quality into a single number and introduce gaming incentives. The system routes. The routing reflects quality. The person experiences the result of the routing without seeing the quality scores that informed it.

What quality assurance cannot prevent
#

No quality system prevents all bad outcomes. A credentialed physician can give poor advice on a specific case. A certified AI agent can produce a biased recommendation that the bias test did not anticipate. A validated Context Shard can contain knowledge that was accurate at validation but became obsolete between reviews.

The architecture’s response to inevitable failures is detection speed and correction speed. The audit trail records every expert interaction. The outcome tracking detects negative patterns. The routing engine adjusts. The system that catches a bad recommendation after one occurrence and adjusts routing is fundamentally different from a system that routes the person to the same bad expert indefinitely because it never checks.

James, whose propulsion diagnostics shard will eventually require updating as new engine architectures enter the maintenance pipeline, understands this better than most. Engineering knowledge is not eternal. It is maintained. The Expert Exchange Layer’s quality architecture is a maintenance system, not a perfection system. It tracks, it measures, it adjusts, it improves. It does not guarantee that every answer is right. It guarantees that the system is paying attention.

Cross-References
#

BMT-03.02 Trust Tiers. The trust tier architecture that provides the baseline credentialing framework for external agents entering the Expert Exchange Layer.

BMT-02.05 Preference Learning. The P-RLHF system that learns from outcome tracking to improve routing for each individual person.

BMT-04.06 Hard Constraints. The architectural constraints that define behaviors no expert, human or AI, is permitted to exhibit.

Technical Appendix BMT-08.05-A is available to partners and investors at partners.bluemirror.tech.

Human expert verification#

AI agent certification#

Context shard validation#

Outcome tracking#

What quality assurance cannot prevent#

Cross-References#