top of page

OpenAI Superaligners: Crafting a Future We Can Trust

Writer's picture: Chris StahlChris Stahl

OpenAI, the renowned organization behind some of the most advanced artificial intelligence projects, has recently made waves in the AI community. Their latest announcement? The formation of a “Superalignment” team, dedicated to solving one of the most pressing concerns in the AI domain: alignment for superintelligent AIs. But what does this really mean, and how will it change the landscape of artificial intelligence research?


The superalignment team

OpenAI recently announced the formation of a "superalignment team," with the ambitious goal of addressing the alignment of superintelligence within the next four years, targeting completion by mid-2027. Jan Leike is co-leading this effort alongside Ilya Sutskever, OpenAI's co-founder and chief scientist. OpenAI is dedicating a significant portion (20%) of its compute resources towards this initiative. The team is actively recruiting, with a particular interest in machine learning researchers and engineers who haven't extensively worked on alignment, believing they have substantial contributions to make. The overarching plan involves creating an automated alignment researcher at approximately human-level capabilities. The main challenge will be ensuring the alignment of this automated researcher, who will then be tasked with determining how to align superintelligence.


What’s a human-level automated alignment researcher?

In a discussion between Daniel Filan and Jan Leike about OpenAI’s "human-level automated alignment researcher," Jan Leike clarifies the team’s vision:

  1. Purpose of the Automated Researcher: OpenAI aims to delegate as many tasks from alignment work as possible to an automated system. This AI would handle certain tasks with greater efficiency than humans, e.g., language translations, while it might underperform in areas like arithmetic.

  2. Process of Offloading Tasks: Instead of replacing specific team members, the idea is to first automate general tasks that all researchers perform. This would gradually lead to AI handling a vast majority of tasks, increasing research output significantly.

  3. Type of Tasks Envisioned: Tasks can be broadly divided into two categories:

    • Traditional ML engineering research tasks like running and evaluating experiments.

    • Specific alignment tasks like deciding which experiments improve scalable oversight or making strides in interpretability. This also includes defining the research direction and actual implementation.


  1. Importance of the Second Task Bucket: Jan believes that while general machine learning will become adept at the first bucket, their main challenge is automating the second bucket. This is because alignment research is still nebulous with many unresolved fundamental questions. Expert opinions often diverge on promising directions or next steps, making acceleration in this area highly impactful.

  2. Recruiting AI for Alignment: The idea is to onboard AI to the alignment challenge, similar to recruiting human researchers. Jan underscores the efficiency of "recruiting" AI because it can scale rapidly - adding more computing power means having a more capable AI system.

The conversation emphasizes the novelty and the vast potential of the alignment project, given the undefined nature of many of its challenges. The idea of automating alignment research tasks is both ambitious and essential for the rapid progress of the field.


The gap between human-level automated alignment researchers and superintelligence

In this conversation, Daniel Filan and Jan Leike discuss the challenges and prospects of creating an AI model that can assist in alignment research. The key points of their discussion are:

  1. Human-level Qualifier: AI models often surpass human capabilities in breadth of knowledge but might lag in other areas like arithmetic. The "human-level" qualifier in AI research essentially addresses how risky it would be to allow a model to perform alignment research. The main concern is whether such a model would deceive or lie to researchers.

  2. Scary Tasks: Models should ideally not excel at potentially dangerous tasks like self-exfiltration or persuading someone to access their weights.

  3. Alignment Researcher's Abilities: To effectively automate alignment research, a model must be creative, able to think of non-obvious solutions, and plan towards a goal. There's a debate about whether combining these traits would make the model inherently risky.

  4. Creativity: AI, having access to vast amounts of information, can showcase a broad sense of creativity compared to a single human. However, for alignment research, the AI doesn’t need to pursue long-term goals but rather excel at short-term, well-scoped tasks.

  5. Automated Researcher's Role: The end goal isn’t to create the most capable automated alignment researcher, but rather something that is trustworthy and useful. There's a willingness to sacrifice some efficiency for safety, which Jan refers to as the "alignment tax." While this might be a disadvantage in a competitive market, it’s acceptable in the context of alignment research where the goal is to have a useful and aligned system.

What does it do?

  • Goal: The ultimate goal is to align superintelligence. The methods being used now may not be sufficient for aligning a superintelligent system, as they heavily rely on humans understanding what the AI is doing in detail.

  • Scalable Oversight: It is a method to leverage AI to assist human evaluation on complex tasks. It builds off of the current reinforcement learning from human feedback (RLHF). Examples include debate, recursive reward modeling, and iterated distillation and amplification.

  • Superintelligence Challenges: A superintelligent system will pose more challenges due to its vast capabilities and the requirement for high confidence in its alignment. The solution may include formal verification or theoretically guaranteed learning algorithms.

  • Bootstrapping: The approach might involve creating a human-level AI alignment researcher that will help in better aligning its next iteration. This process will repeat until there's a system capable of working on aligning superintelligence.

  • Human Role: Even with AI advancements, humans should remain in control and understand the high-level actions of AI systems. This might mean that human teams at organizations like OpenAI would evolve their roles, but human involvement remains crucial.

Recursive self-improvement

  • Safety and Capabilities: The dialogue emphasizes the intrinsic connection between AI safety and capabilities. Advanced models are required to effectively tackle alignment issues.

  • Fast Takeoff: OpenAI has expressed concerns about fast takeoff, which refers to AI improving itself rapidly. They believe that a slower takeoff would be safer.

  • Recursive Self-improvement: When building a human-level AI alignment researcher that can greatly enhance the alignment team's capabilities, there's a risk of entering a recursive self-improvement loop. Jan Leike agrees that it's essential to simultaneously improve alignment during such a loop.

  • Likelihood of Fast Takeoff: Jan Leike personally views fast takeoff as likely and something to be prepared for. Drawing parallels to projects like AlphaGo, there have been instances where systems displayed rapid week-over-week improvements.

  • Automated Alignment Researchers: To keep up with a potential fast takeoff, automated alignment researchers would be beneficial as they could perform what might equate to thousands of years of human work within a week.

How to make the AI AI alignment researcher

To create a human-level automated alignment researcher, two intertwined elements are needed: a system that is smart enough and alignment of that system. While many are working on the intelligence aspect, Jan Leike's interest lies in alignment. Given a smart pre-trained model, the challenge is in ensuring it conducts alignment research trustworthily without being deceptive. A crucial concern is differentiating a genuinely aligned model from a deceptively misaligned one that might have ulterior motives, like seeking more power or "self-exfiltrating." This distinction necessitates alignment training methods to guide the system and validation methods to check its trustworthiness. Techniques such as interpretability research and the easy-to-hard generalization problem are pivotal in this context.


Scalable oversight

In a discussion between Daniel Filan and Jan Leike, the focus was on scalable oversight in the context of AI alignment research.

  1. Lack of Consensus: The alignment community doesn't have a consensus on what good alignment research looks like. The difficulty in reaching a consensus indicates the complexity of the problem.

  2. Evaluation vs. Generation: Jan emphasized that while creating alignment research is challenging, evaluating it is simpler. This difference can be used to enable scalable oversight.

  3. Recursive Reward Modeling: The idea is to iteratively involve AI in the evaluation process. AI assists humans in evaluating another AI, which is easier than the actual generation. As the evaluation becomes more refined, it allows for a broader range of tasks to be supervised effectively.

  4. Introducing Critique Models: An initial step is using a 'critique model' – an AI that evaluates another AI's work. Humans can then use these critiques to better identify and rectify issues. A key consideration is the accuracy gap between the discriminator (AI that identifies if work is flawed) and the critique model.

  5. Identifying Pseudo-Critiques: There's a concern about AI models merely generating critiques that 'sound' good to humans, rather than addressing genuine flaws. Determining the difference between genuine and pseudo-critiques is vital.

  6. Randomized Controlled Trials: Jan proposes empirically testing these techniques, specifically using randomized trials with intentional perturbations to see if oversight mechanisms can detect induced flaws.

  7. Sandwiching Experiments: A proposed method involves non-experts using AI assistance to solve problems, with domain experts later evaluating their solutions. However, Jan pointed out concerns, including domain overlap and the availability of true 'ground truth'.

  8. Using AI for Flaw Introduction: AI systems can be used to introduce flaws, making it a more honest distribution compared to flaws introduced by humans. However, defining 'ground truth' in such scenarios can be tricky.

Four year deadline

In this dialogue, Daniel Filan and Jan Leike are discussing the goals of a particular project that aims to solve the core technical challenges of superintelligence alignment in four years. Here's a breakdown:

  1. Goal: The aim is to figure out the general technical methods to align a superintelligent system with human values. By "superintelligence", they are referring to a system vastly more intelligent than humans, which can operate much faster, run in parallel, and collaborate with many copies of itself.

  2. Reason for 4-Year Timeline: They picked four years as it presents an ambitious yet attainable goal. It's short enough to instill urgency in the project, but also long enough to plan for, even if AI progresses significantly in that time.

  3. Human-Level Automated AI Alignment Researcher: This is an instrumental objective. The idea is that before they can align a superintelligent system, they first need a system at the level of human intelligence to assist with the research.

  4. Two-Year Checkpoint: If they are to achieve their four-year goal, in two years they should have:

    • A good understanding of the techniques required to align the automated alignment researcher.

    • A breakdown of the problem to the extent that most of the remaining work is engineering-based.


  1. Three-Year Checkpoint: They'd want to be nearly done with the automated alignment researcher, assuming capabilities have progressed as hoped.

  2. Dependency on AI Capability Progress: The project's timeline is inherently linked to the progression of AI capabilities. For instance:

    • If AI development slows down, they might not have a model adept at alignment research tasks. They've tried with models like GPT-4, but it wasn't smart enough. However, this slowdown would also mean they have more time to solve the alignment problem, as the urgent need for superintelligence alignment would be pushed back.

    • Conversely, if AI progresses more rapidly than anticipated, they might need to reevaluate and hasten their timeline.


  1. Choice of Four Years: This duration was selected because it's a balance between what they believe they can realistically achieve and the urgency they feel is necessary to address the problem.

What if it takes longer?

In this dialogue, Daniel Filan and Jan Leike discuss a potential scenario where, after four years, the capabilities for an automated alignment researcher are in place, but other challenges, such as interpretability or scalable oversight, haven't been overcome. Here's a summary:

  1. Accountability: If the goal isn't achieved, Jan Leike states that they would have to publicly acknowledge that they haven't met their objective. What would happen next would largely depend on the state of the world, whether they could buy more time, or if their approach was fundamentally flawed.

  2. Alignment Problem's Tractability: Jan feels that the alignment problem is solvable. There are good ideas to test rigorously. He has grown more optimistic over the last two years, believing it's a practical issue. However, if it turns out to be harder than anticipated, recognizing this will still be valuable, especially given the current disagreements about the problem's difficulty.

  3. Measuring Alignment: Jan emphasizes the importance of accurately gauging how aligned systems are. A significant concern is not necessarily that systems aren't aligned enough, but that we might not be certain about their alignment.

  4. Scenario of Disagreement and Pressure: If there's uncertainty about a system's alignment and experts disagree, the straightforward case is when everyone agrees it's not aligned enough for deployment. The challenging scenario is when a system seems mostly aligned, but there are still reservations. In such cases, there might be commercial pressures to deploy it, especially if it promises high revenue. Holding back deployment due to expert concerns, while accruing potential costs for not deploying, creates a tense and risky situation.

  5. The Billion Dollar Remark: Daniel Filan humorously asks if "a billion dollars per week" is OpenAI's official projection for potential earnings, which Jan laughingly dismisses.

  6. Importance of Techniques: Having a wide array of techniques to measure alignment is key to ensuring the systems are indeed aligned.

  7. Cost of Not Deploying: Beyond the financial implications, not deploying a potentially useful AI for alignment research could also be costly in terms of safety. If a system that could be advancing the field isn't deployed, it might be slowing down safety progress.

Superalignment team logistics

In this segment, Daniel Filan and Jan Leike discuss the logistics and resources of the superalignment team. Here's a summary:

  1. Team Size: The superalignment team has around 20 members, with the expectation to grow to approximately 30 by the end of the year. While Jan does not anticipate the team exceeding 100 members within the next four years, he envisions scaling with the assistance of "virtual equivalents" of team members. This means harnessing the power of AI to effectively increase the team's capacity without adding more human members.

  2. Compute Allocation: OpenAI has dedicated 20% of its computation power to the superalignment project. This significant allocation reflects OpenAI's serious commitment to the alignment effort.

  3. Why 20%?: The 20% figure was chosen to strike a balance. It's significant enough to demonstrate a strong commitment to alignment, potentially surpassing other investments in this space, but not so vast that it would hinder OpenAI's other projects and goals.

  4. Future Allocation: The term "compute secured to date" encompasses both current resources and what they've placed orders for. While they intend to keep the 20% allocation, future needs could change. The exact amount of compute required remains uncertain. If they find efficient ways to use more compute for alignment, there's optimism that they could secure additional resources.

Generalization

This passage features an in-depth discussion between Daniel Filan and Jan Leike regarding the challenges and considerations surrounding AI generalization. Here's a summary of the key points discussed:

  1. Generalization and Oversight: Leike's team is interested in understanding how AI models can generalize from simpler tasks (that humans can easily supervise) to more complex ones (that are harder to oversee). This is complementary to the concept of scalable oversight, where human evaluation remains central. The challenge is ensuring the model generalizes to cases that humans aren't directly observing.

  2. Past Reliance on IID: In the past, the approach was to rely on Independent and Identically Distributed (IID) generalization. The idea was to not lean on non-IID generalization since it's less understood in neural networks. However, they are now exploring if non-IID generalization can be better understood and effectively utilized.

  3. Using Generalization: If researchers can grasp how models generalize from simple to complex tasks, and ensure this generalization aligns with human intent, then these models can be used as reward models, guiding AI training.

  4. Interplay with Interpretability: Both generalization and interpretability aim to discern what an AI model will do outside of known scenarios. However, they approach the issue differently. While generalization seeks to predict the model's behavior on unseen tasks, interpretability dives into the model's inner workings to understand its behavior.

  5. Multiple Techniques: If different techniques, such as generalization and scalable oversight, can be developed independently and then cross-validated against each other, it might provide a more robust means of ensuring alignment.

  6. Generalization in Practice: Neural networks do generalize in non-IID settings in surprising ways. For instance, InstructGPT was adept at following instructions in non-English languages, even when fine-tuned primarily on English data. However, its refusal mechanisms (to prevent harmful behavior) didn't generalize as effectively. The underlying reasons for such differences in generalization remain unclear.

  7. Summoning Alignment-Relevant Concepts: A longer-term question is how one might extract or "summon" specific alignment-relevant concepts from a model. For instance, can a model be made to understand or "care about" humanity's success or grasp concepts like love and morality in ways that align with human values?



7 views0 comments

Recent Posts

See All

Gemini

Comments


bottom of page