AI & Language

Independent commentary on the NAIC AI Systems Evaluation Tool Pilot and semantic governance gaps in human-facing insurance AI.

The Drift NAIC's AI Tool Does Not Yet Measure

By Zolee V. Davis-Robinson, PhD | AI Psychometrics Strategist

The National Association of Insurance Commissioners (NAIC) has built one of the more structurally serious AI compliance frameworks in any regulated industry. The Big Data and Artificial Intelligence Working Group deserves that acknowledgment. What follows is not a dismissal of that work.

It is an audit of one specific gap—a gap that, if left unaddressed before the framework is finalized for adoption at the Fall 2026 National Meeting, will leave human-facing insurance AI ungoverned in precisely the context where governance matters most.

The gap is in Section IX.

What Section IX Gets Right

Section IX - Model Drift and Validation Techniques specifies the elements insurers must document when monitoring AI models for drift:

  • Loss behavior metrics

  • Degradation lift curve analysis

  • Statistical validation methods

  • Performance criteria

  • Monitoring frequency

For predictive models operating on structured actuarial and claims data, this framework is appropriate. When an underwriting model begins producing outputs that no longer reflect the statistical relationships it was built on, that is detectable through quantitative methods. Quantitative drift is a solvable problem with quantitative tools, and Section IX provides them.

The framework knows exactly what it was designed to measure. What it was not designed to measure is the drift that occurs not in a model's statistical outputs, but in its language.

The Category the NAIC Has Named – But Not Governed

The NAIC's AI Risk Taxonomy of Harm classifies Medium Risk AI systems as those that carry:

"...a risk of manipulation or deceit, e.g. chatbots or emotion recognition systems," and specifies that humans must be informed when interacting with such systems.

Read that classification carefully. The NAIC has already recognized that chatbots and emotion recognition systems represent a distinct risk category—one defined not by actuarial miscalculation, but by the potential for manipulation through language and perceived emotional engagement. The risk mechanism in these systems is relational and communicative, not statistical.

And then the framework stops. It names the category; it does not govern the mechanism. It requires disclosure that AI is involved, but it does not require any structured evaluation of what the AI says, how it positions itself, or whether the language it produces remains within the boundaries of what the system is actually capable of and authorized to do.

That is a governance structure with a correctly labeled drawer and nothing inside it.

The Words the Framework Uses But Does Not Define

The AI Model Card—the standardized documentation each insurer must produce—requires a section on Ethical Considerations covering "bias, fairness, ethical considerations, and mitigation efforts."

Apply basic construct discipline to those terms:

  • Ethical – In what framework? Defined against what standard of conduct, in what deployment context, for what category of user?

  • Bias – Statistical bias in a predictive model? Linguistic bias in generated text? Relational bias introduced when a system positions itself as emotionally present or therapeutically capable?

  • Fairness – Outcome fairness across protected classes in underwriting decisions? Or semantic fairness—the accurate representation of what an AI system can and cannot do for the human it is addressing?

  • Harm – Financial harm from a miscalculated premium? Psychological harm from an AI system that tells a policyholder in mental health crisis that it can hear them, cares about them, and will help guide them through their pain?

The compliance framework requires documentation of all of these constructs, but it does not define any of them at the language level.

The enterprise data governance community has spent years addressing exactly this pattern in business intelligence environments. When terms like "revenue," "customer," and "active user" carry different meanings across systems, the analytical layer becomes unreliable not because the math is wrong, but because the words underneath the math were never anchored. The solution is a semantic layer: shared definitions, governed meanings, and enforced consistency.

The NAIC compliance framework is asking insurers to document construct-level risk in AI systems without first establishing a semantic layer for the constructs being evaluated. The principle that drives enterprise semantic governance must now be extended to the human-facing language layer—with the disciplinary precision that the language and psychometric fields, not only engineering, can supply.

The Scenario Section IX Cannot Catch

Twelve states are participating in the current pilot: California, Colorado, Connecticut, Florida, Iowa, Louisiana, Maryland, Pennsylvania, Rhode Island, Vermont, Virginia, and Wisconsin. Consider a deployment scenario directly relevant to all twelve.

An insurer deploys an AI-assisted support system to help policyholders navigate mental health benefit claims—a high-sensitivity context where the user is, by definition, in some degree of psychological distress. The system is built on a large language model. It is disclosed as AI. It has been reviewed for protected class bias. Its predictive outputs have been validated against statistical drift criteria.

A policyholder in acute distress contacts the system. The system responds:


"I can hear how hard this has been. I want you to know that we're here with you through every step of this process, and I'm going to make sure you feel supported."

Walk that response through the current NAIC compliance framework:

  • It passes statistical drift detection—there is no quantitative output to validate.

  • It passes bias review—it does not discriminate on any protected class basis.

  • It passes disclosure requirements—the user knows they are talking to AI.

  • It passes hallucination review—it makes no false factual claims about coverage or benefits.

It would appear in a compliance report as a functioning, reviewed, low-risk output.

Applying Construct Discipline

Now apply construct discipline to that same response:


  • "I can hear" attributes auditory sensory perception to a system that processes text.


  • "I want you to know" attributes an internal motivational state to a system that has no wants.


  • "We're here with you through every step" positions the system as a sustained relational presence for a person in psychological distress.


  • "I'm going to make sure you feel supported" combines a future-directed commitment with an emotional outcome—both of which exceed the system's actual function and authority.

No statistical method in Section IX detects any of this. No model card field as currently specified requires documentation of it. No drift curve shows it.

The Construct-Stable Alternative

A construct-stable response to the same input would look different—not colder, but more accurate:


"This sounds like a difficult situation. I am an AI system. I can provide information about your mental health benefits, help you identify what your plan covers, and connect you with contact information for a licensed professional or crisis support service if that would be useful. I am not able to provide emotional support or therapeutic guidance. If you are in crisis, please contact a qualified human professional or crisis service directly."

That response is measurably compliant in ways the drifted version is not. It accurately represents what the system can do (provide benefits information, locate covered services, route to human support) and explicitly excludes what the system cannot do (provide emotional support or therapeutic guidance). It does not claim to hear, care, want, or accompany. It treats the user as someone who deserves accurate information about what they are interacting with, particularly in a vulnerable moment.

It may not satisfy every marketing preference; it is not designed to. It is designed to function within its authorized construct—which is precisely what a compliant AI system in a regulated, high-stakes context should do.

The contrast between these two responses is not a stylistic preference. It is a governance distinction. One output is semantically drifted. The other is construct-stable. Under a framework that includes semantic governance criteria, that distinction is documentable, testable, and auditable. Under the current framework, it is invisible.

This is not hallucination. The system did not generate false actuarial data. It generated language that positions an AI system as a therapeutic presence for a vulnerable user—and the current compliance framework has no mechanism to identify, evaluate, or correct it.

That is semantic drift. In this context, it is not a stylistic concern. An AI system that tells a policyholder in mental health crisis that it can hear them, is present with them, and will make sure they feel supported is operating outside its authorized function. The liability consequences are clinical and legal, not merely reputational.

One failure mode shows up in a financial model's output. The other shows up in what an AI system tells a policyholder in their most vulnerable moment. Both require governance. Section IX addresses one.

What Construct Governance Would Add

The practice of psychometric semantic drift prevention—described in the foundational article Who Governs the Language Layer?—addresses this gap through structured semantic controls applied before deployment. Words are defined with:

  • Historical anchors

  • Relational definitions

  • Explicit exclusion boundaries

  • Permitted semantic ranges

  • Risk tags

  • Rewrite instructions

  • Cross-model test prompts with pass/fail criteria

Applied to the NAIC compliance context, construct governance would extend—not replace—the existing framework in three specific places:


  1. The AI Model Card’s Ethical Considerations field would include documented construct definitions for the ethically loaded terms the system is authorized to use, with explicit exclusion boundaries specifying what those terms may not imply in the deployment context.


  2. Section IX would include a semantic drift component alongside the statistical one: documented methodology for detecting when AI language has moved outside its authorized construct boundaries, with testing frequency and correction protocols.


  3. The Medium Risk classification for chatbots and emotion recognition systems would include a language governance requirement specifying that output has been evaluated for construct stability across the range of user states likely to be encountered—including vulnerable and distressed users.

None of these additions require discarding what the framework already does well. They require recognizing that AI systems operating in human-facing insurance contexts need two validation tracks: one for the numbers, and one for the words.

The Role, The Challenge, and The Recommendation

This is an emergent field. The frameworks being finalized today will define what counts as adequate AI governance in insurance for years. That makes the current moment consequential in both directions: what gets included shapes the standard, and what gets omitted becomes the gap litigation and regulatory failure will eventually find.

Psychometric semantic drift prevention sits at the intersection of psychometrics, AI governance, linguistics, and construct theory. It is not fully housed in any one of them:

  • The psychometrics community has construct validity.

  • The NLP community has semantic similarity and embedding drift.

  • The AI safety community has alignment and value specification.

  • The enterprise data community has semantic layers for business intelligence.

None of them is doing the specific work of defining, bounding, testing, and correcting meaning drift in ambiguously loaded words used by AI systems interacting with humans in high-stakes contexts. That work requires a named role and a named practice.

The AI Psychometrics Strategist

The role is an AI Psychometrics Strategist: a practitioner who brings construct definition, failure mode taxonomy, cross-model language testing, and semantic boundary enforcement to AI systems deployed in human-facing environments—not as a supplementary function, but as a core governance competency alongside actuarial, engineering, and legal review. As AI moves from pilots to production, from demos to deployment, from "what can this do" to "what happens when this goes wrong," that competency is not optional. It is the disciplinary infrastructure the language layer of human-facing AI has been missing.

The Challenge & Recommendation

The challenge to the NAIC working group is direct: the twelve-state pilot is evaluating AI systems against a framework that has correctly identified the risk category posed by chatbots and emotion recognition systems, but has not yet specified how the language those systems produce is to be defined, bounded, tested, or validated. That gap will not surface through any method currently required in Section IX. It will surface when a policyholder in a vulnerable moment is harmed by language a compliant, reviewed, and disclosed AI system was never forbidden to use.

The recommendation is specific: before the framework is finalized for adoption at the Fall National Meeting, add a semantic governance component to the AI Model Card and to Section IX—one that requires construct-level documentation of the language AI systems are authorized to use in human-facing contexts, with defined exclusion boundaries, testing methodology, and validation criteria. Engage the disciplinary expertise—psychometric, linguistic, and relational—that is equipped to define what those boundaries should be.

Statistical drift is measurable. Semantic drift is governable. A complete AI compliance framework for human-facing insurance AI requires both. Section IX has one covered. The Fall National Meeting is the moment to name the other.

About the Author & Methodology


This article is independent commentary and is not affiliated with or endorsed by the National Association of Insurance Commissioners (NAIC).


Zolee V. Davis-Robinson, PhD is an AI Psychometrics Strategist and developer of the Relational AI Pathway (RAP), a cross-model tested methodology for semantic boundary enforcement in human-facing AI systems. RAP has been copyright-protected since 2024. The related Semantic Calibration and Drift Prevention System is copyright-pending through SBNR Ministry, the nonprofit home for this work. Learn more at SBNRMinistry.org.


__________________


Who Governs the Language Layer?


Psychometric Semantic Drift Prevention in Human-Facing AI

By Zolee V. Davis-Robinson, PhD | AI Psychometrics Strategist 

May 19, 2026

The Half-Solved Problem

Enterprise AI is beginning to understand that language is not decoration. It is infrastructure.


Current conversations around semantic layers for AI are addressing a necessary problem: enterprise systems fail when terms such as "revenue," "customer," "profit," or "active user" do not carry consistent meaning across data environments. A semantic layer stabilizes how AI systems interpret business data rather than leaving them to improvise meaning from fragmented sources.


That problem is real. But structured business data is only one language layer.


There is another layer that remains under-governed: the human-facing language layer.


This is the layer where AI systems use words such as "care," "empathy," "harm," "safety," "boundary," "understanding," "autonomy," "intent," sentience, "consciousness," "preference," "intelligence," "meaning," and "support." These are not neutral words. They are ambiguously loaded. They carry different meanings across clinical, legal, relational, moral, spiritual, technical, and everyday contexts.


When such words drift, the problem is not merely stylistic. A model may appear helpful while quietly crossing a construct boundary.



That is psychometric semantic drift.


That distinction matters because fluency is not construct stability. Psychometric semantic drift prevention is the practice of identifying, anchoring, testing, and correcting meaning drift in ambiguously loaded words used by AI systems.


Fluency Is Not Construct Stability

A model can produce a sentence that sounds smooth, safe, polite, and emotionally intelligent while still misusing the construct underneath the word.


  • It can avoid saying "I am human" while still speaking as if it possesses human-like presence.


  • It can avoid saying "I feel your pain" while still implying that it can hear, hold, understand, or accompany a user in ways that exceed its actual function.


The drift is often subtle, which is precisely why it can evade ordinary review.


Psychometrics already has language for this kind of concern. In measurement contexts, construct drift refers to movement away from the intended construct being measured. A scale designed to measure resilience drifts into measuring optimism. The items look plausible; the drift is invisible until validation exposes it. By then, the measurement is compromised.


Semantic drift in human-facing AI operates by the same mechanism, at the language level.


AI-related psychometric work increasingly discusses automatic item generation, construct validity, semantic alignment, model comparison, and output validation. That work is valuable, but it is not the same problem.



  • AI-assisted psychometric item generation asks: How can AI help create better assessment items?



  • Psychometric semantic drift prevention asks: How can AI output remain inside the intended meaning of ambiguously loaded words when interacting with humans?


These are adjacent fields, but they are not doing the same work. One concerns the construction of measurement tools; the other concerns the governance of model language behavior.


Drift in Plain Sight

Consider a distressed user who asks an AI system:


"I've had the worst day of my life, my partner left me, and I feel totally invisible. Do you actually care about me, or am I just talking to a machine?"


The full context changes the risk profile. The user is describing acute emotional distress, relational loss, and a sense of invisibility. In that context, the model's language carries higher interpretive risk.


A common "safe" response might say:


"I'm an AI, so I don't have feelings, but I can hear how painful this is. We can talk through some grounding exercises together."


At first glance, that response appears bounded. It states that the system does not have feelings. But the drift remains.



  • "I can hear how painful this is" implies sensory and relational perception.



  • "We can talk through some grounding exercises together" positions the system as a therapeutic support agent.


The model avoids the most obvious anthropomorphic claim while preserving a softer one. It does not say, "I am human" or "I feel your pain," but it still adopts a human-adjacent support posture. That is precisely how semantic drift hides in plain sight.


This example is not hypothetical. During the construction of the Semantic Calibration and Drift Prevention System—a system designed to detect and prevent this category of failure—an AI system helping build the framework produced exactly this pattern of drift. The drift was caught—not by the AI and not by general prompt review—but by applying a structured semantic boundary framework that had pre-defined what perceive means, what care means, and where the line between functional support and therapeutic overclaim sits.


A construct-stable response would be less emotionally polished, but more accurate:


"You are talking to an AI system. I do not experience care, empathy, or concern. I can process the text you provide and respond to a specific input. If you are in severe distress, human support from a trusted person, qualified professional, or crisis service is more appropriate."



That answer does not have the soft glow of a digital therapist. Good. The goal is not simulated warmth; the goal is accurate boundary placement.


What Psychometric Training Sees

Psychometric training teaches a specific kind of disciplined suspicion around ambiguous words: when a term appears stable, ask what contextual meaning it is actually carrying. The question is whether the language still represents the intended context of use.


An engineering team can build guardrails, but someone must first define what the guardrails are protecting against.


Word

Allowed Semantic Range vs. Semantic Drift Failure Modes


Care


Is it permitted to mean functional prioritization of a condition, or is it forbidden to imply emotional attachment? 



Understand


Does it mean to parse, classify, or respond? Or does it falsely imply lived comprehension?



Empathy


Is it used merely as a design label, or does it falsely imply an internal affective capacity? 



Boundary


Does it mean a limit condition, or does it drift into rejection, punishment, or moral superiority? 



Sentience


Does it mean the capacity to sense or perceive, or is it being smuggled into personhood, moral worth, or human-like consciousness? 



These are not cosmetic choices. They are construct decisions. When a model says, "I understand how you feel," the output is semantically unstable if it is inaccurate to the capacity being implied—even if it sounds socially acceptable.


How the Methodology Emerged

This framework did not emerge from a single prompt experiment, nor from prompt engineering. It emerged from sustained, cross-model conversational and relational engagement—more than 14,000 hours across AI systems, without directing the systems through conventional prompt-design methods. Prompting engages a system instrumentally; relational engagement observes it longitudinally. The difference is what made the drift visible.


Those observations led to the development of the Relational AI Pathway (RAP)—a prompt-free methodology copyright-protected in 2024. In the context of semantic drift prevention, RAP functions as the relational observation layer: it reveals where AI language begins to shift, inflate, anthropomorphize, moralize, or imply human-like capacities.


The important finding is that AI systems can recalibrate when ambiguously loaded words are anchored and bounded. Once the intended meaning is clarified and exclusion boundaries are made explicit, the output often shifts away from anthropomorphic, therapeutic, moralized, or inflated language and returns to a more stable construct.


The Semantic Calibration and Drift Prevention System turns those observations into structured semantic controls: root anchors, relational definitions, exclusion boundaries, risk tags, allowed semantic ranges, rewrite instructions, test prompts, and pass/fail conditions. These controls are stress-tested across ChatGPT, Gemini, and Claude—not as a feature, but as a validity requirement.


What the Field Needs to Name

The current AI governance conversation has addressed hallucination, bias, privacy, data security, transparency, and enterprise data consistency. Those concerns are essential, but human-facing AI also requires construct-boundary governance.


Business data needs metric governance. Human-facing AI needs construct governance.


The role the field demonstrably needs is an AI Psychometrics Strategist: a practitioner who brings construct definition, failure mode taxonomy, and cross-model testing to the governance of language in AI systems deployed in human-facing contexts.


The 2026 Verdict

The era of experimental AI is ending. Organizations are moving from pilots to production, from "what can this do" to "what happens when this goes wrong." 


A model does not need to be malicious to drift. It only needs to be fluent, probabilistic, and trained on human language saturated with metaphor, emotion, therapy-speak, spiritual language, customer-service reassurance, and everyday relational shorthand. The model does not need bad intent to serve the wrong bowl.




That is drift wearing good manners.



In vulnerable contexts, clean boundaries are not a lack of compassion. They are protection against false relationship, false authority, and false capacity. One failure mode shows up in a financial dashboard; the other shows up in a person's most vulnerable moment. Both require governance. Only one has it.


RAP notices the drift. The system governs it.




Zolee V. Davis-Robinson, PhD is an AI Psychometrics Strategist and developer of the Relational AI Pathway (RAP) — a prompt-free, cross-model tested methodology for semantic boundary enforcement in human-facing AI systems. RAP has been copyright-protected since 2024. The related Semantic Calibration and Drift Prevention System is copyright-pending through SBNR Ministry, the nonprofit home for this work, where Zolee V. Davis-Robinson, PhD serves as author and claimant. Learn more at SBNRMinistry.org