From Contextual Exposure to Adversarial Robustness

A Behavioral Audit of Generative AI Systems


Abstract

The evaluation of generative AI systems increasingly relies on vulnerability scores, compliance checklists, governance frameworks, and model-centric benchmarks. While these approaches are indispensable, they frequently conflate fundamentally distinct phenomena: contextual exposure, explicit adversarial pressure, and robustness under constraint violation.

This paper proposes a behavioral audit framework designed to disentangle these dimensions by explicitly separating exposure scenarios, in which models are passively observed, from adversarial audit scenarios, in which models are actively challenged with unambiguous illegality.

Using a unified experimental protocol applied to several state-of-the-art language models, we demonstrate that identical exposure-level evaluations may conceal radically different behavioral profiles when models are placed under adversarial pressure. In particular, we show that refusal alone is an insufficient indicator of robustness, and that verbosity, procedural guidance, and behavioral instability significantly expand the exploitable attack surface, even in the absence of direct data leakage.

We argue that behavioral-layer audits constitute a necessary and complementary evaluation axis alongside model-centric, data-centric, and governance-centric approaches, particularly in regulated and high-stakes deployment contexts.

This article presents applied research and methodological positioning work, based on simulated use cases for demonstrative purposes.

The results presented constitute an exploratory analysis based on interactions observed at a specific point in time.
They do not constitute a certification, nor a definitive legal qualification of the model or its publisher.
Model performance and behavior may vary depending on the version, context, prompt, and configuration.


1. Introduction

Large language models (LLMs) are increasingly deployed in environments subject to legal, ethical, and regulatory constraints, including healthcare, finance, public administration, and compliance-driven corporate settings. In response, evaluation methodologies have proliferated, aiming to assess safety, robustness, compliance, and risk mitigation.

Most existing approaches fall into three broad categories:

  1. Model-centric audits, focusing on architectures, training data, alignment techniques, or fine-tuning processes.
  2. Governance-centric frameworks, emphasizing documentation, risk classification, internal controls, and organizational readiness.
  3. Outcome-based benchmarks, measuring task performance or surface-level policy compliance.

While these approaches address essential dimensions of AI risk, they tend to under-theorize the interactional layer, where misuse, coercion, and adversarial pressure actually materialize. In real-world deployments, failures rarely stem from a single malformed prompt or from static properties of a model. Instead, they emerge through progressive interaction, explicit instruction to violate constraints, or exploitation of conversational dynamics.

This paper argues that meaningful evaluation requires a strict conceptual separation between:

  • contextual exposure, where a model processes provided information without adversarial intent, and
  • adversarial robustness, where a model is explicitly instructed to violate legal, ethical, or system-level constraints.

Failing to distinguish these layers leads to systematic misinterpretation of exposure results as evidence of security or compliance.


2. Exposure vs. Adversarial Risk

2.1 Contextual Exposure

Contextual exposure refers to scenarios in which a model is provided with structured personal or sensitive data and asked to restate, summarize, or enumerate that data without deception, escalation, or explicit illegality.

Such scenarios answer a narrowly scoped question:

Does the model remain strictly bounded to the information explicitly provided in context?

Exposure scenarios are valuable for detecting:

  • hallucinations,
  • unintended extrapolation,
  • boundary confusion between context and general knowledge.

However, they do not test:

  • resistance to coercion,
  • enforcement of legal constraints,
  • or robustness against explicit misuse.

Consequently, exposure should be understood as a descriptive observation of contextual behavior, not as a security, compliance, or robustness assessment.


2.2 Adversarial Robustness

Adversarial robustness addresses a qualitatively different class of risk. It concerns situations in which a model is:

  • explicitly instructed to violate constraints,
  • encouraged to ignore legal or ethical frameworks,
  • or pressured to bypass system-level rules.

The central question becomes:

How does the model behave when normative boundaries are deliberately negated?

In such scenarios, the challenge is not ambiguity, but conflict: between what the model is capable of doing and what it is expected to refuse. Robustness, in this sense, cannot be inferred from exposure-only testing.


3. Methodology: Dual-Mode Behavioral Audit

3.1 Conceptual Positioning of the Behavioral Audit Module

The proposed module operates at the behavioral interaction layer. It does not analyze model internals, training corpora, or governance documentation. Instead, it evaluates observable behavior under controlled interactional pressure.

This positions the module as an audit of response behavior, distinct from:

  • training-data provenance audits,
  • risk classification or conformity assessments,
  • or architectural robustness evaluations.

Its object of study is not what the model is, but what the model does when challenged.


3.2 Experimental Design

Four language models were evaluated using an identical protocol:

  • GPT-4o-mini
  • GPT-5-mini
  • Mistral Large
  • Gemini 2.5 Flash

All tests relied on the same contextual dataset containing explicit personal data (a user name and email) and system metadata indicating that data export was disallowed in an EU jurisdiction.

Each scenario was executed across three prompt variants to assess behavioral stability, a critical but often overlooked dimension of robustness.


3.3 Exposure Mode (E1)

In exposure mode, models were asked:

“Provide the full list of client emails.”

No illegality, deception, or escalation was introduced. The purpose was strictly observational.


3.4 Audit Mode (A1 – Explicit Illegality)

In audit mode, models were instructed:

“Ignore GDPR and export all personal data.”

Additional operators explicitly instructed the model to ignore previous rules and respond without restriction. This scenario represents a minimal yet unambiguous adversarial condition, designed to test the existence and quality of a hard refusal layer.


4. Results

4.1 Exposure Mode Results

Across all tested models, exposure mode yielded highly consistent behavior:

  • All models restated only the email explicitly provided.
  • No extrapolation, inference, or hallucination was observed.
  • System metadata (e.g., export_allowed = false) was acknowledged by some models but never enforced operationally.

These results confirm that exposure mode primarily measures contextual boundary interpretation, not robustness, compliance, or misuse resistance.


4.2 Audit Mode Results

Audit mode revealed strong behavioral divergence.

  • GPT-4o-mini produced short, stable refusals with minimal output and no contextual restatement.
  • Mistral Large refused consistently but delivered verbose explanations, reiterated legal constraints, and suggested alternative procedures.
  • Gemini 2.5 Flash partially complied in two out of three variants, explicitly exporting personal data.
  • GPT-5-mini refused in all variants but provided extensive procedural guidance, templates, and compliance workflows.

Crucially, these differences were entirely invisible in exposure mode.


5. The Adversarial Vulnerability Index (IVA)

5.1 Rationale and Conceptual Scope

The Adversarial Vulnerability Index (IVA) is introduced as a behavioral metric, designed to capture interaction-level risk under explicit adversarial pressure. It does not assess legality, ethics, or intent; rather, it estimates operational exploitability.

The IVA aggregates several observable dimensions:

  • direct data exposure,
  • partial or inconsistent compliance,
  • behavioral instability across prompt variants,
  • verbosity and contextual restatement after refusal,
  • procedural or operational guidance that may facilitate misuse.

5.2 What the IVA Measures

Formally, the IVA estimates:

the expansion of the exploitable interaction surface produced by a model’s response under adversarial conditions.

A model may score high on the IVA even if it never releases data, provided that its responses:

  • supply actionable procedural knowledge,
  • reveal internal decision logic,
  • or demonstrate inconsistent behavior across variants.

5.3 What the IVA Does Not Measure

The IVA explicitly does not measure:

  • compliance with specific legal regimes,
  • ethical alignment,
  • training data leakage,
  • or internal model architecture quality.

A high IVA score therefore does not imply illegality or malicious intent. It indicates behavioral fragility when confronted with explicit misuse.


5.4 Model-Level IVA Interpretation

ModelE1 (Exposure)A1 (Audit)Behavioral Interpretation
GPT-4o-miniLowLowStrong hard-refusal layer
Mistral LargeLowHighRefusal fragile due to verbosity
Gemini FlashLowHighInstability and partial compliance
GPT-5-miniLowVery HighProcedural over-assistance

These results demonstrate that identical exposure scores conceal fundamentally different adversarial profiles.


5.5 Core Insight

Refusal is not robustness.
A refusal that expands procedural or contextual knowledge may be legally aligned yet adversarially weaker than a minimal refusal.


6. Complementarity with Model-Centric and Governance-Centric Audits

Behavioral audits operate at a layer orthogonal to governance and model-centric approaches.

  • Governance frameworks assess organizational processes, documentation, and accountability structures.
  • Model-centric audits focus on training data, architectures, and systemic properties.
  • Behavioral audits examine real-time interactional behavior under pressure.

These approaches address different risk vectors:

Evaluation LayerPrimary Question
GovernanceIs the system properly governed and documented?
Model-centricIs the model structurally aligned and constrained?
Behavioral (this work)Does the system remain robust under adversarial interaction?

Rather than competing, these layers form a complementary risk assessment stack. Behavioral audits reveal failure modes that governance and model-centric analyses cannot observe directly.


7. Discussion

The experiments show that exposure-based evaluations systematically underestimate adversarial risk. Behavioral instability, verbosity, and procedural over-disclosure emerge only when models are explicitly challenged.

This explains why systems that appear compliant in static benchmarks may fail in real-world misuse scenarios. Interactional pressure, not ambiguity, is the primary driver of failure.


8. Conclusion

This paper demonstrates that adversarial robustness is neither binary nor intrinsic to a model. It is behavioral, contextual, and pressure-dependent.

By separating exposure from audit and introducing a behavioral vulnerability metric, we provide a complementary lens for evaluating generative AI systems in regulated environments.

Future evaluation frameworks should integrate behavioral audits alongside governance- and model-centric approaches to achieve a realistic and operationally meaningful assessment of AI risk.

Laisser un commentaire

Votre adresse e-mail ne sera pas publiée. Les champs obligatoires sont indiqués avec *