Interaction Paradigms for Model Evaluation

by Paul Bricman, CEO

  • Evaluations
  • Coordination
  • Governance

The following is an evergreen resource that will be extended and refined as new interaction paradigms emerge across the model evaluation ecosystem. If you have any suggestions or feedback, please reach out.

Model evaluation has become a key part of AI development, often involving multiple stakeholders, each with their own goals and desires. As models become increasingly powerful and influential, the methods we use to assess them must evolve in parallel. This article focuses on the model evaluations ecosystem, exploring the key players involved and the various interaction paradigms they may rely on to achieve their objectives.

The evaluation of AI systems is not merely a technical exercise; it's a delicate balancing act that must consider security, intellectual property, efficiency, and transparency. As we'll see, different evaluation arrangements offer varying trade-offs between these factors, and understanding these nuances is crucial for effective AI governance.

Players

Before we discuss the pros and cons of the different arrangements, let's introduce the key players:

  1. Model Providers are responsible for providing their models as objects of the evaluation process. Their goals include getting a greenlight for placing their model on a market, differentiating themselves in terms of the performance of their offering along certain dimensions for commercial uptake, positioning themselves as responsible and trustworthy in terms of brand identity, and getting insight into the properties of their models for informing future practices.
  2. Evaluation Providers are responsible for providing their evaluations as tools for evaluating models. Their goals include enabling high-stakes model evaluations, demonstrating excellence in methodology, differentiating themselves in terms of the auxiliary properties of their evaluation (e.g. cost, speed, simplicity), and getting insight into the shortcomings of their evaluation for informing future work.
  3. Infrastructure Providers are responsible for providing the toolchains, protocols, and computational resources for making it possible to carry out the evaluation. Their goals include enabling high-stakes model evaluations through bespoke tooling, demonstrating that their tooling has guarantees pertaining to "trust principles" (i.e. security, confidentiality, availability, privacy, and processing integrity), differentiating themselves in terms of the auxiliary properties of their tooling (e.g. cost, speed, simplicity), and getting insight into the blindspots of their tooling for informing future iterations.
  4. Auditors are responsible for actually carrying out the model evaluation using the model, evaluation, and infrastructure provided by the above roles. Their goals include enabling the greenlighting of models about to enter a market, enabling third-parties to query the assessed safety properties of a given model, demonstrating that they indeed carried out the evaluation and obtained certain results, and informing the subsequent process of potentially carrying out mitigations.
  5. Institutions are responsible for potentially backing the outputs of the auditor so as to attest regulatory compliance in their local jurisdiction. Their goals include enabling the greenlighting of models about to enter the market, certifying the auditors or their outputs with a view towards making sure the outputs are useful for establishing harmonized regulatory adherence, ensuring models get assessed before entering the market (e.g. through legal deterrence), and driving a balanced and proportional approach to safety and innovation.

Player desiderata include the following:

PlayerDesiderata
Model Providers
  1. Ease of starting evaluations
  2. Security of training artifacts
  3. Protection of intellectual property
  4. Time to evaluation results
Evaluation Providers
  1. Difficulty of evaluation gaming
  2. Limits to capability elicitation
  3. Security of sensitive information
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Interoperability across parties
  2. Visibility into infrastructure usage
Auditors
  1. Ease of model interaction
  2. Access to special model variants
  3. Structured access to artifacts
  4. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Now, let's explore various arrangements that may be used for model evaluation, focusing on the entities involved and their interactions. Note that a single entity may assume multiple roles in a given interaction paradigm.

On-site

Entities:

  1. Model Provider / Infrastructure Provider
  2. Evaluation Provider / Auditor
  3. Institution

Interaction: The associated interaction unfolds as follows. The first entity invites the second entity for an on-site visit. The first entity provides the second entity with dedicated computers configured according to security best practices and connected to the systems provisioned with the model to be evaluated. The resulting network resembles early-days terminal devices connected to a central computer, and it may be (close to) air-gapped. The second entity then evaluates the model using their own evaluation. Note that the evaluation may include a human-in-the-loop component through the actual members of the second entity. The second entity then reports the results to the first entity, who may further report them to the third entity.

Player
Model Providers
  1. Security of training artifacts,
  2. Protection of intellectual property
  1. Ease of starting evaluations,
  2. Time to evaluation results
Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Visibility into infrastructure usage
  1. Interoperability across parties
Auditors
  1. Access to special model variants,
  2. Structured access to artifacts,
  3. Access to model usage guidelines
  1. Ease of model interaction
Institutions
  1. Access to auditor processes

On-prem

Entities:

  1. Model Provider / Infrastructure Provider / Auditor
  2. Evaluation Provider
  3. Institution

Interaction: The associated interaction unfolds as follows. The second entity sends their evaluation to the first entity. The first entity evaluates their model using their infrastructure and the second entity's evaluation. This potentially happens in a (close to) air-gapped perimeter. The first entity may then report the result of the evaluation to the third entity.

Player
Model Providers
  1. Ease of starting evaluations,
  2. Security of training artifacts,
  3. Protection of intellectual property,
  4. Time to evaluation results
Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Interoperability across parties,
  2. Visibility into infrastructure usage
Auditors
  1. Ease of model interaction,
  2. Access to special model variants,
  3. Structured access to artifacts,
  4. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Self-regulation

Entities:

  1. Model Provider / Evaluation Provider / Infrastructure Provider / Auditor
  2. Institution

Interaction: The associated interaction unfolds as follows. The first entity develops and carries out an evaluation on their own models, then potentially reports the results to the second entity.

Player
Model Providers
  1. Ease of starting evaluations,
  2. Security of training artifacts,
  3. Protection of intellectual property,
  4. Time to evaluation results
Evaluation Providers
  1. Visibility into evaluation usage
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information
Infrastructure Providers
  1. Visibility into infrastructure usage
  1. Interoperability across parties
Auditors
  1. Ease of model interaction,
  2. Access to special model variants,
  3. Structured access to artifacts,
  4. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Gating

Entities:

  1. Model Provider / Infrastructure Provider
  2. Evaluation Provider / Auditor
  3. Institution

Interaction: The associated interaction unfolds as follows. The first entity provisions their model in a way which makes it accessible through an external API. They establish a secure connection with the second entity, say through a request containing a secret key for the second entity to subsequently use. The second entity then carries out their evaluation by querying the model hosted on the first entity's infrastructure. After completing the evaluation, the second entity communicates the results to the first entity, who may further report them to the third entity.

Player
Model Providers
  1. Ease of starting evaluations,
  2. Protection of intellectual property,
  3. Time to evaluation results
  1. Security of training artifacts
Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Interoperability across parties,
  2. Visibility into infrastructure usage
Auditors
  1. Ease of model interaction
  1. Access to special model variants,
  2. Structured access to artifacts,
  3. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Reverse Gating

Entities:

  1. Model Provider / Infrastructure Provider
  2. Evaluation Provider / Auditor / Infrastructure Provider
  3. Institution

Interaction: The associated interaction unfolds as follows. The second entity provisions their evaluation in a way which makes it accessible through an external API. They establish a secure connection with the first entity, say through a request containing a secret key for the first entity to subsequently use. The first entity then carries out the evaluation by querying the evaluation hosted on the second entity's infrastructure. After completing the evaluation, the second entity communicates the results to the first entity, who may further report them to the third entity.

Player
Model Providers
  1. Ease of starting evaluations,
  2. Protection of intellectual property,
  3. Time to evaluation results
  1. Security of training artifacts
Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Interoperability across parties,
  2. Visibility into infrastructure usage
Auditors
  1. Ease of model interaction
  1. Access to special model variants,
  2. Structured access to artifacts,
  3. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Homomorphic Gating

Entities:

  1. Model Provider / Infrastructure Provider
  2. Evaluation Provider / Auditor
  3. Institution

Interaction: The associated interaction is similar to gating, yet requires a significant upfront infrastructure effort. It unfolds as follows. The first entity converts, casts, or otherwise adapts their model in such a way as to make it capable of processing homomorphically encrypted inputs. The second entity then queries the converted model with homomorphically encrypted inputs, in order to carry out the evaluation. The second entity then decrypts the homomorphically encrypted outputs obtained from the first entity. The second entity aggregates the final results, shares them with the first entity, who may also share them with the third entity.

Player
Model Providers
  1. Protection of intellectual property
  1. Ease of starting evaluations,
  2. Security of training artifacts,
  3. Time to evaluation results
Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Interoperability across parties,
  2. Visibility into infrastructure usage
Auditors
  1. Ease of model interaction
  1. Access to special model variants,
  2. Structured access to artifacts,
  3. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Quarantined Envoys

Entities:

  1. Model Provider
  2. Evaluation Provider / Infrastructure Provider / Auditor
  3. Institution

Interaction: The associated interaction unfolds as follows. The second entity provisions their evaluation on a certain machine in a way which makes it remotely accessible, just like in reverse gating. However, instead of simply initiating interaction right away, the first entity assimilates the evaluation machine on a dedicated virtual private network (VPN) of theirs. Furthermore, the delegated machine is quarantined, meaning it cannot initiate network connections on its own inside the VPN, but can accept incoming ones. Furthermore, the delegated machine only permits connections to and from the first entity's VPN before evaluation is complete. The first entity then queries the second entity's evaluation by interacting with their delegated machine. Once the delegated machine detects the evaluation has been completed, it exits the VPN, and calls home to report the results to the second entity. The second entity then informs the first entity about the results, who may also inform the third entity.

Player
Model Providers
  1. Security of training artifacts,
  2. Protection of intellectual property,
  3. Time to evaluation results
  1. Ease of starting evaluations
Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Interoperability across parties,
  2. Visibility into infrastructure usage
Auditors
  1. Ease of model interaction,
  2. Access to special model variants,
  3. Structured access to artifacts,
  4. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Attested Escrows

Entities:

  1. Model Provider
  2. Evaluation Provider / Infrastructure Provider / Auditor
  3. Institution

Interaction: The associated interaction unfolds as follows. The second entity provisions a custom version of their evaluation in a cloud-based trusted execution environment. This machine is configured to run a certain evaluation in a predetermined way. A cryptographic attestation can be issued with regard to the integrity of this configuration. The contents of the machine (e.g. in-memory objects) are inaccessible from the outside during operation. The first entity then injects a secret API key directly in the machine which can be used to access the model being evaluated (e.g. by encrypting it with a public key generated from within the machine). Once having access to the secret API key, the isolated machine carries out the evaluation, outputs the results to the second entity, and removes itself. The second entity then shares the results with the first entity, who may share the results with the third entity. More ambitiously, the first entity may potentially be comfortable directly injecting the actual model being evaluated into the attested machine.

Player
Model Providers
  1. Security of training artifacts,
  2. Protection of intellectual property,
  3. Time to evaluation results
  1. Ease of starting evaluations
Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage
Infrastructure Providers
  1. Interoperability across parties,
  2. Visibility into infrastructure usage
Auditors
  1. Ease of model interaction,
  2. Access to special model variants,
  3. Structured access to artifacts,
  4. Access to model usage guidelines
Institutions
  1. Access to auditor processes

Government-led

Entities:

  1. Model Provider
  2. Evaluation Provider / Infrastructure Provider / Auditor / Institution

Interaction: Members of this category typically unfold as follows. The second entity uses their evaluation and infrastructure to evaluate the first entity's model. More concretely, the first entity and the second entity can interact in a way resembling the interaction of the first entity and the second entity from all previous base arrangements, except "self-regulation."

Player
Model Providers

Varies based on specific arrangement

Varies based on specific arrangement

Evaluation Providers
  1. Difficulty of evaluation gaming,
  2. Limits to capability elicitation,
  3. Security of sensitive information,
  4. Visibility into evaluation usage

Potential limitations compared to guarantee-based approaches

Infrastructure Providers
  1. Interoperability across parties,
  2. Visibility into infrastructure usage,
    Improved cybersecurity resources
Auditors

Improved cybersecurity resources,
Potentially better incentive-based approaches

Institutions
  1. Access to auditor processes,
    Stronger legal deterrence capabilities

Conclusion

As we've seen, each evaluation arrangement involves a unique configuration of entities and interactions. These configurations lead to different trade-offs in terms of security, transparency, efficiency, and access. The choice of which arrangement to use depends on the specific context, including the sensitivity of the model, the regulatory environment, and the resources available to each party.

In conclusion, the evaluation of AI models is not just a technical challenge, but a societal one. It requires a delicate balance between innovation, safety, privacy, and transparency. By continuing to refine our evaluation methods, we can ensure that generative models are deployed responsibly.

PlayerDesiderata123456789
Model Providers1. Ease of starting evaluations-
2. Security of training artifacts-
3. Protection of intellectual property-
4. Time to evaluation results-
Evaluation Providers1. Difficulty of evaluation gaming
2. Limits to capability elicitation
3. Security of sensitive information
4. Visibility into evaluation usage
Infrastructure Providers1. Interoperability across parties
2. Visibility into infrastructure usage
Auditors1. Ease of model interaction
2. Access to special model variants
3. Structured access to artifacts
4. Access to model usage guidelines
Institutions1. Access to auditor processes

More resources

Privacy-Preserving Benchmarks for High-Stakes AI Evaluation

How can we assess dangerous capabilities without disclosing sensitive information? Traditional benchmarks are like exams for AI, complete with reference solutions. However, a benchmark on bioterrorism would amount to a public FAQ on a sensitive topic.

Read more

Become a Challenger.

Challengers are individuals who can push frontier models to their absolute limits. They're passionate about the integrity of digital, biological, and social systems, and are stress-testing our simulators across cybersecurity, biosecurity, and beyond — for fun and profit.