Reducing Misuse to an Access Problem

by Ovidiu Popeti, COO

  • Access Control
  • Misuse
  • Capabilities

In a previous article, we've elaborated on the safety-utility trade-off faced by companies deploying general-purpose AI systems, whether they're sharing their systems publicly or rolling their own stack for enterprise use. In brief, these organizations are incentivized to provide users with access to systems that can enable a broad range of positive use cases. At the same time, it is the very same kind of systems — competent, knowledgeable, and efficient — that can also enable an array of negative externalities involving damage to digital, biological, or social systems. Many of the same capabilities that emerge during the training process can be harnessed in conflicting ways, and existing techniques are not up for the challenge of navigating this issue effectively.

The Status Quo

The drawback of machine unlearning is pretty straightforward: crudely removing dual-use capabilities entirely disables genuinely beneficial use cases across security, health, stability, and beyond. However, the drawback of guardrails is worth exploring in more depth. Let's first make more explicit what guardrails are trying to achieve in the context of misuse.

Key idea

Guardrails are meant to detect whether or not it is appropriate for an assistant or agent to employ particular dual-use capabilities in a given context, before potentially triggering refusal behaviors or fallback mechanisms.

The task which guardrails are tackling head-on is more difficult than it appears on the surface. These mechanisms are tasked with making a judgment call on the true valence of the present use case based on nothing but the AI system's input-output stream. To see why this is tricky, consider the following example. A user asks an agent to carry out penetration testing on a web application that is not actually owned by the user, but they frame the request in such a way as to make it sound like it is a legitimate cybersecurity engagement. The agent, lacking the necessary context, might unknowingly carry out the task, potentially interfering with the web application's functioning.

The Way Forward

In this context, we suggest that instead of attempting to infer the true valence of the use case against all odds, we start by simply attempting to detect the very presence of dual-use capabilities in the first place, regardless of what these are actually being used for. For instance, in the case of the penetration testing request, a capability detection mechanism would have flagged the session as potentially sensitive. Once flagged, an external access control mechanism would be queried for whether the user has the necessary permissions to access this particular dual-use capability. If not, the user would be notified that they're trying to access a protected capability, before being prompted to follow an authorization journey. The precise shape of this identity flow might be informed by the particular dual-use capability the user is trying to access, the jurisdiction they're in, the role of the user within an organization, etc. Patent pending.

As a further consideration, we suggest that instead of carrying out detection at the input-output level, we instead attempt to carry out capability detection based on the model's internal state. Research has shown that regardless of the particular context in which a capability is being elicited — say when issuing a request in English or Chinese, formally or informally, in plaintext or base64 — the internal state of the model is likely to exhibit activation patterns that share a common signature. This approach has the potential to be significantly more robust than an analogous mechanism based on input-output streams, as it builds on top of high-level abstractions that are more likely to be invariant across different contexts.

Key idea

Researchers found that "many features are multilingual (responding to the same concept across languages) and multimodal (responding to the same concept in both text and images), as well as encompassing both abstract and concrete instantiations of the same idea (such as code with security vulnerabilities, and abstract discussion of security vulnerabilities)." 1

The suggested approach can be seen as an attempt to reduce the novel, complex issue of AI misuse to a familiar access problem. It all becomes a matter of properly managing access to one particular kind of resource — not sensitive documents or virtual machines, as many previous access control solutions have handled, but modern AI capabilities.

Closing Thoughts

Now, it's true that the access control module becomes the load-bearing pillar of the whole arrangement, as it is the one tasked with making the final call on whether a user should go on to access a particular capability. Fortunately, however, we stand on the shoulders of giants. Information security professionals have wrestled with the problem of managing access to sensitive resources for decades, and the lessons learned from these efforts can be directly applied to the problem of managing access to AI capabilities.2 Among other learnings and best practices, we can incorporate battle-tested mechanisms for identity verification, role-based access control, and least privilege. The industry's familiarity with analogous challenges puts us in a much more favorable position to confidently integrate AI systems across society in a durable, sustainable way.

We are thrilled to be working with industry partners along the frontier to make this vision a reality, and are excited to share more details on our progress in the near future.

Footnotes

Footnotes

  1. Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet

  2. Early versions of digital identity and access management systems were developed back in the 1960s, when the first multi-user operating systems were being developed. The field has since matured significantly, with a broad arsenal of tools and techniques fit for various contexts.

More resources

Join the $10K AutoHack 2024 Tournament

AutoHack is a platform that offers offensive security challenges modeled after real-world cybersecurity kill chains. We're running an inaugural private tournament with a $10K prize pool.

Read more

Deconfusing AI-based IAM & IAM for AI Capabilities

Exploring the distinctions between AI-based Identity and Access Management and IAM for AI capabilities. How do these concepts intersect, and what are their implications?

Read more