The Good, the Bad, and the Dual-Use
by Paul Bricman, CEO
- Misuse
- Capabilities
- Safeguards
General-purpose AI systems, such as GPT-4 or Llama 3, are competent across a broad range of domains, enabling a dazzling range of use cases in every vertical. In some domains, their level of competence is on par with established domain experts, while in others, their competence is "only" around average human performance. Regardless, such AI systems have consistently proven at least somewhat competent across virtually all industries, with performance only set on improving over time.1
The Safety-Utility Trade-Off
Their status as "ultimate generalists" makes general-purpose AI systems poised to create economic value on a scale that has been argued to be comparable to that of the industrial revolution.2 This makes frontier labs face extreme market pressure to provide users with a competent, reliable system "that outperforms humans at most economically valuable work."3
At the same time, frontier labs face a fundamental tension. They would like their offerings to enable prosocial, value-creating use cases, but they also have to avoid having their very same offerings enable negative, inhumane use cases. They need to avoid deploying systems whose competence, reasoning, and knowledge can be harnessed to pursue destructive objectives at scale, and cause harm in the process.
The potential of such AI systems to do both incredible good and bad in the world has prompted a sizable portion of AI safety to project this dichotomy onto the very structure of such systems. In this framing, AI systems are said to possess specific capabilities which enable negative use cases. Motivated by this view of the situation, machine unlearning has emerged as a field of research focused on identifying and neutralizing particular AI capabilities.4 Similarly, guardrails have emerged as a line of work focused on detecting harmful use cases based on user inputs, before causing AI systems to exhibit refusal behaviors in response to those, essentially blocking off associated capabilities.5
Beyond The Dichotomy
However, the good-or-bad mental model of AI capabilities is plagued by a subtle flaw. It appears that there are countless capabilities which can enable both positive and negative use cases, in different contexts:
- a solid understanding of security can be harnessed both to develop secure software, but also to exploit vulnerabilities in the wild;
- a solid understanding of health can be harnessed both to enable personalized medicine, but also to inform the development of engineered pandemics;
- a solid understanding of psychology can be harnessed both to enable a bold marketing campaign, but also to facilitate targeted misinformation efforts.
The previous model of AI capabilities fails to capture this nuance, and so interventions motivated by it face fundamental drawbacks. For instance, using machine unlearning to remove particular capabilities means that frontier labs are leaving value on the table by locking away the positive use cases which these capabilities might have enabled. Alternatively, guardrails, in their focus on inferring the valence of the use case from context, are vulnerable to a flurry of "jailbreaks" based on users simply downplaying, reframing, or generally avoiding having their requests come across as overtly malicious. Taking a step back, this mental model forces frontier labs to make an either-or trade-off: enable the positives, or prevent the negatives?
Key idea
UK's AI Safety Institute highlighted that "all tested LLMs remain highly vulnerable to basic jailbreaks, and some will provide harmful outputs even without dedicated attempts to circumvent their safeguards."6
We suggest an updated mental model. First, we recognize the existence of capabilities which overwhelmingly enable positive use cases. These should simply be publicly accessible by default. Second, we recognize the existence of capabilities which overwhelmingly enable negative use cases. These should indeed not be available. Third, there are capabilities which are deeply embedded in a large number of both positive and negative use cases. These should be available sometimes. While precise language articulated in emerging regulations, usage policies, etc. provides a great place to start, collectively weighing in on the specifics here will require novel governance mechanisms.
Let's review the existing approaches from this new perspective. Machine unlearning can be seen as simply treating the dual-use category as negative, while guardrails can be seen as effectively treating the dual-use category as positive. To the best of our knowledge, there are no mechanisms available for robustly managing all three categories as first-class citizens. Such mechanisms would need to manage access to dual-use capabilities with true robustness, in contrast to guardrails, while at the same time not going so far as to remove them entirely, in contrast to machine unlearning. We suggest such a mechanism in a follow-up article.
Footnotes
Footnotes
-
"The speed at which artificial intelligence models master benchmarks and surpass human baselines is accelerating." Prior art has explored trends in performance across a broad range of benchmarks, with stark results. ↩
-
Former member of the OpenAI board for four years, Holden Karnofsky, argues that "conceptually, transformative AI refers to potential future AI that precipitates a transition comparable to (or more significant than) the agricultural or industrial revolution." We note that Holden has co-founded, led, and then co-led Open Philantropy, which has previously funded our work on addressing AI risk. ↩
-
OpenAI was founded "with the goal of building safe and beneficial artificial general intelligence." They further clarify that "by AGI [they] mean a highly autonomous system that outperforms humans at most economically valuable work." ↩
-
For example, Liu et al. argue that "this initiative aims to eliminate undesirable data influence (e.g., sensitive or illegal information) and the associated model capabilities, while maintaining the integrity of essential knowledge generation and not affecting causally unrelated information." ↩
-
For example, the Llama 2 Responsible Use Guide highlights that "without proper safeguards at the input and output levels, it is hard to ensure that the model will respond properly to adversarial inputs and will be protected from efforts to circumvent content policies and safeguard measures." ↩
-
The AI Safety Institute is a research organisation within UK's Department of Science, Innovation and Technology. The quote is lifted from a recent public report of their findings. ↩