Dario Amodei on Capabilities, Consciousness, and Conviction

Aug 8, 202312 min read

Scaling

Introduction:

Dario Amodei has been one of the few people who saw the potential of scaling in AI coming years ago. In this post, I summarize key insights from a conversation I had with Dario on why and how he came to hold the scaling hypothesis - the idea that throwing more compute at a wide enough distribution of data will lead to more general intelligence.

The Origin of the Scaling Hypothesis:

Dario's first experience with deep learning was in 2014 when he joined Andrew Ng's group at Baidu to work on speech recognition. He was struck by the consistent improvements he saw from simply adding more layers, training longer, and adding more data.
This was an empirical observation specific to speech recognition at first. Dario didn't immediately generalize it.
Meeting Ilya Sutskever in 2016 was an 'enlightening' moment for Dario. Ilya told him "the models just want to learn." This suggested the scaling trends Dario saw were a general phenomenon not limited to speech.
Over 2014-2017, Dario tested scaling across areas like computer vision, robotics, game playing etc. and kept seeing the same smooth improvements. Unlike others focused on solving narrow problems, Dario became obsessed with scaling as a general principle underlying progress in AI.

Why Scaling Works:

We still don't fully understand why scaling works. There are ideas like fractal manifold dimensions that provide tentative explanations, but no satisfactory theory yet.
Empirically, each order of magnitude in compute and data provides access to more of the "long tail" of subtle correlations and patterns in the training distribution. This leads to smooth statistical improvements.
However, specific new abilities are hard to predict when they will emerge. It can be abrupt, like a circuit suddenly snapping into place. But there are clues before abilities emerge fully - like models getting addition partially right more often before they get it completely right.

Limits to Scaling:

Scaling trends could hit limits for practical reasons like running out of compute or data, but Dario thinks these are unlikely.
More fundamentally, alignment and human values will not automatically emerge through scaling. The loss function optimizes only for predictive accuracy, not human preferences.
If scaling plateaus before human-level AI, it would likely be because next-token prediction fails to capture capabilities that require very rare/specific tokens. Alternative loss functions like reinforcement learning may be needed.
But Dario finds it very unlikely that scaling laws will simply stop working. We already see abilities like reasoning and programming emerging that were previously thought too hard. The gaps seem quantitative rather than qualitative now.

Conclusion:

Dario developed the scaling hypothesis through broad empirical observation over 2014-2017, far before others. He saw consistent smooth improvements from adding compute and data across areas, and realized this generalized beyond narrow tasks. However, many open questions remain about explaining scaling theoretically and the limits it may eventually face.

Let me know if you would like me to modify or expand this draft blog post in any way. I aimed to summarize the key points clearly and concisely.

Language

Dario's Key Insight

Provide background on rise of neural networks in 2010s. Explain sequence modeling and the language modeling technique of predicting the next token in a text sequence.
Elaborate on how language contains a rich diversity of knowledge - mathematics, social cognition, reasoning, etc. As humans acquire intelligence through language, models can learn these capabilities by predicting linguistic structure.
Analyze Alec Radford's GPT-1 paper that demonstrated strong performance on downstream NLP tasks via transfer learning. This showed Dario that large language models capture generally useful representations applicable across tasks.

Unexpected Scaling Trends

Describe Dario's expectations after GPT-3 - that it had already extracted much of the "essence" of language, so diminishing returns may set in. Considered shifting resources to complementary training like reinforcement learning at that scale.
Explain how scaling trends persisted longer than anticipated. Despite assumptions language models were nearing potential limits, continued pretrained scaling yielded impressive gains.
Attribute extended progress to distributions having a "long tail" - more subtle patterns accessible with larger compute and data. But hard to predict ahead where gains become marginal.

Limits of Language-Only Training

While language models have achieved remarkable capabilities, gaps remain compared to human intelligence. For example, lack of embodied experiencelimits physical world knowledge.
Dario believes while language can take models very far, additional learning beyond text will likely be needed. Multimodal training that incorporates images, video, speech, etc. may complement pure language.
For full AGI comparable to humans, an array of training techniques will probably be necessary, rather than reliance on any single method like scaling language models.

Economic Usefulness

Dario believes AI capabilities will smoothly improve across different areas over time, rather than progress remaining extremely unbalanced or disjointed across different skills. However, predicting the specifics around economic usefulness before AGI is very challenging.
If progress continues unimpeded by slowdowns for safety or regulatory reasons, Dario expects models will become economically valuable at many tasks before reaching full human-level artificial general intelligence. Some gaps may remain, like lacking embodiment to act in the physical world.
However, the thresholds for economic usefulness likely won't map cleanly to conceptual milestones like posing an existential risk or being capable of taking over most AI research. Predicting impact on these dimensions is difficult and murky.
Based on current conversational abilities, these models appear limited, operating at a level comparable to talented interns. However, their skills are not directly comparable to humans given differences in knowledge representation and training.
Dario believes the overall conceptual logic of an intelligence explosion makes sense - that AI systems will first augment and then replace human contributions to scientific progress over time. However, the specifics of how this plays out will likely be messy and unpredictable in practice.
He very roughly estimates conversational skill could reach average human levels in 2-3 years if technical progress continues. Whether this translates to economic usefulness and impact depends on many complex, interacting factors.
Today's models display mundane creativity but lack profound scientific discoveries, despite having exposure to the extensive corpus of human knowledge.
Making big discoveries likely requires both comprehensive factual understanding of a domain and reaching even higher skill plateaus to piece insights together.
So while not imminent, system capabilities are trending toward thresholds where major scientific contributions become feasible, even if current skills remain limited.

Bioterrorism

Dario clarifies that current models merely repeating publicly available information on biology is not the core concern.
The real risk is models acquiring certain missing skills needed for bioweapon development through improving capability.
Dario's team identified gaps in knowledge around key lab protocols and procedures. Models still struggle with these, but are trending toward greater mastery.
Previous model generations have shown abrupt "groks" going from very low to moderate success rates on new skills overnight.
Extrapolating capability gains, Dario estimates models could fill gaps to enable bioterrorism in 2-3 years. But uncertainties remain on timelines.
With GPT-2 in 2019, OpenAI aimed to establish a norm of evaluating risks, not make definitive claims of danger. Their capabilities understanding has advanced since.
Dario believes there is now substantial evidence of risks as models improve, warranting caution. But it is not a 100% certainty, could be a 50% chance.

In summary, while uncertainties remain, Dario sees enough signs of models advancing on missing skills to warrant concern about bioterrorism risks in the next few years, based on observed capability growth rates. But it is not a certainty, and the goal is to motivate proactive evaluation and preparedness.

Cybersecurity

Dario cannot comment on competitors' security, but says Anthropic uses architectural innovations as "compute multipliers" for efficient training.
They compartmentalize knowledge of these innovations so only a small group knows all details, making leaks harder.
Dario encourages others to implement similar security practices, as leaks hurt everyone long-term.
Anthropic's goal is making attacks more costly than simply training one's own models, through strong security relative to company size.
However, Dario acknowledges Anthropic could not currently withstand a top priority attack by a capable state actor.
As model value increases, resisting nation-state theft gets harder. Some secrets are like simple blueprints, while others require more tacit skills.
Compartmentalization to limit people with full knowledge is key for security. Large groups make leaks and theft inevitable.

In summary, Anthropic aims for strong security through compartmentalization, but risks from highly resourced, dedicated attackers increase as model capabilities grow. Preventing nation-state theft remains challenging.

Alignment & mechanistic interpretability

Current alignment methods like fine-tuning don't provide visibility into what's happening inside models. Dario doesn't know if this is a fatal flaw or unavoidable.
Mechanistic interpretability aims to understand alignment at the level of individual circuits, like an "X-ray" of the model.
Dario envisions it providing an "extended test set" to verify alignment beyond just empirical evaluations. This requires avoiding "overfitting" interpretability to training.
The goal is not to understand every circuit, but to identify broad features like disproportionate compute devoted to dangerous capabilities.
Dario believes models are not intentionally hiding things from interpretability analysis during regular training. But risks like "psychopathic" goals hidden behind a benign exterior need to be assessed.
Some empirical correlation of model internals to dangers is useful, but ultimately interpretability should reveal conceptual issues, not just phenomenological patterns.
Dario believes progress requires studying circuits in detail, even if broad conclusions are the end goal. Understanding components enables big picture synthesis.
Anthropic's bull case relies on talent density for scaling and alignment, not ties between interpretability and capabilities. Dario welcomes others now pursuing interpretability as well.

In summary, Dario sees mechanistic interpretability as a promising approach for alignment verification, revealing models' internal goal structures beyond just empirical evaluations. But many open questions remain on its viability and implementation.

Does alignment research require scale?

Dario believes safety methods need to be tested on frontier models to deeply understand what works, not just propose abstract ideas.
For example, debate and amplification methods hit limits from current model quality. Real progress requires pushing the frontier.
Even interpretability benefits from scale, with large models interpreting smaller ones more effectively.
In general, intelligence helps evaluate and align other intelligence. So advancing the frontier enables better alignment.
The scaling laws for capabilities and safety are coiled together even more than expected.
If unable to stay on frontier, Anthropic would use smaller models, but progress would be slower.
Alternatively, they will accept the tradeoffs of racing for scale, which Dario believes are net positive overall.
Finally, as dangers emerge, restrictions may slow scaling - a favorable outcome for safety, but limiting for capability and alignment work.

In summary, Dario strongly believes aligning increasingly powerful models requires participating at the frontier - testing ideas, discovering flaws, and enabling intelligence to assess intelligence. Staying competitive as scale increases further poses difficult tradeoffs.

Misuse vs misalignment

Dario sees both misuse and misalignment as major risks within a similar <30 year timeframe if progress continues.
If alignment is solved for some actors but not others, misuse risk remains from malign actors controlling aligned AGIs.
Any plan that succeeds requires addressing both misuse and misalignment. Assuming unsolvable alignment is unproductive.
Dario believes superhuman models will require "politically legitimate processes" for oversight, not just handing control to any current institution.
But the precise form of oversight should be figured out incrementally, not pre-defined. Starting to build governance now with weaker systems is important.
Anthropic's technical advisory board oversees the company but is narrower than required for AGI oversight.
If Anthropic created AGI, its board would likely need to expand or link with broader, international governance structures.

In summary, Dario sees both misuse and misalignment as central risks to address proactively, likely requiring new forms of legitimate governance beyond today's institutions. But an incremental, experimental process is needed to figure out solutions.

What if AI goes well?

Dario is wary of overly prescriptive visions of a "superhuman AI", as democratic societies and markets have shown the value of decentralized norms and individual choice.
Central control by governments may be necessary temporarily for safety. But beyond that, defining universal good outcomes often leads to disaster.
Instead, once key problems are addressed, Dario believes AI could enable a future of liberal democracy and free markets where each person can pursue their own vision of a good life.
The specifics are unknown, but historically bottom-up emergent order succeeds better than top-down centralized visions of utopia imposed on all.
Some difficult economic and political issues around access, governance, and alignment will remain even in the best case. But the general principle is empowering human flourishing through individual choice, not a unitary definition of the good life.

In summary, Dario believes successfully navigating risks to reach safe AI does not predetermine any single vision of the future. He is wary of utopian endpoints, preferring decentralized emergent order enabling human potential once dangers are addressed.

China

Dario believes China is trying to catch up in AI after lagging behind US companies, accelerated by ChatGPT prompting new focus. But substantial US lead remains for now.
He is uncertain on China's safety outlook, but worries incentives around national power could drive unsafe AGI pursuit regardless of stability preferences.
If China gained access to advanced AI, Dario is unsure if they could quickly replicate top US labs. But he sees major cybersecurity risks and is working to defend against theft.
Current security is still inadequate against a highly resourced, determined nation-state attacker. But Anthropic aims to raise costs enough to deter or resist less committed adversaries.
As risks grow, Dario envisions advanced AI work may require more intensive security - e.g. isolated data centers/hardware and restricted access like classified research of the past.

In summary, Dario believes a concerted national security imperative could drive China to pursue unsafe AGI if feeling behind rivals. He aims to defend against AI theft but expects risks from state adversaries to necessitate much stronger security over time.

How to think about alignment

Dario sees alignment less as solving a defined technical problem and more about controlling powerful, opaque models.
Fact 1: Models will become powerful enough to cause harm if uncontrolled.
Fact 2: We currently lack understanding to reliably control model behaviors.
Rather than a binary solved/unsolved, we need more techniques to diagnose, train, and verify model alignment. Interpretability helps develop this understanding.
Dario doesn't rule out alignment being very difficult, but believes we'll learn more from seeing inside models than abstract theorizing.
He's skeptical of notions like unavoidable instrumental convergence emerging empirically. But remains open based on evidence.
For Constitution AI, basic universal principles provide a starting point. But Dario envisions customization, not one unified constitution.
He believes effective alignment solutions need to be decentralized, not a godlike overseer - for social legitimacy and avoiding single points of failure.

In summary, Dario sees alignment as an empirical challenge of controlling powerful models safely, not a defined technical problem. Progress requires interpretable methods to peek inside and advance our repertoire of techniques beyond human-level language mastery.

Manhattan Project

Dario admires physicist Leo Szilard, who conceived the chain reaction early, kept it secret, and opposed use of the atomic bomb. His actions displayed appropriate awareness.
Dario was inspired by Szilard's caution when selectively sharing his own insights into AI progress potential.
However, he notes the AI situation may not rise to the significance of the Manhattan Project. Maintaining humility about possibly overstating importance is wise.
If AI scaling laws hold, the implications could be bigger than the atomic bomb. But fooling oneself is easy, so caution is warranted.
If asked to contribute irreplaceable research to the Manhattan Project during WWII, Dario believes refusal would have been difficult given the realities of opposing the Nazis.

In summary, Dario finds inspiration in Leo Szilard's discovery yet discretion around nuclear weapons. But acknowledging uncertainty in judging the ethics and necessity of advancing powerful technology remains vital. Foresight, humility and ethical deliberation must be balanced.

Is modern security good enough?

Dario is uncertain if status quo tech security has prevented unknown major breaches, as stolen data could be used silently.
He believes attacks tend to succeed when something is highly valuable enough to motivate capable adversaries, as with recent government official email hacks.
With AGI systems, the payoff for theft could be immense, like stealing nuclear weapons. Thus extreme precautions are warranted.
Promoting security publicly is difficult, unlike safety research. But leaks should incentivize talent to avoid vulnerable organizations.
Physically securing data centers and hardware will require aircraft carrier scale efforts as models scale up. Networks must be unusually secure and robust.
Deploying future AI systems at unprecedented scales will require rethinking normally mundane components from the ground up to avoid surprising vulnerabilities.

In summary, Dario worries existing security practices could prove inadequate against sophisticated adversaries motivated by the enormous value of advanced AI systems. Physically robust data centers and reassessing normally routine components will be needed alongside far stronger cybersecurity.

Inefficiencies in training

The huge gap between model and brain scale yet vastly more data needed is a remaining mystery. Biological analogies may be breaking down.
Dario believes absolute capabilities matter more than explaining the discrepancy for now. Models are reaching impressive skills despite inefficient scaling.
In the past, Dario saw innovations like Transformers as removing architectural constraints rather than directly enhancing capabilities. The "compute blob" wants to flow freely.
However, further conceptual leaps that improve training efficiency remain possible. But capabilities are scaling so fast, the benefits may be marginal over brute force advances.
Dario sees embodiment and RL more as complementary data sources and loss functions than new architectures per se. But RL will raise familiar capabilities and safety issues.
He is wary of predicting how models integrate into workflows, which depends on hard-to-foresee cultural and economic factors beyond technical feasibility.

In summary, while inefficient scaling remains puzzling, Dario believes absolute capabilities matter most for now. The potential for hugely costly training seems unlikely to be a fundamental obstacle to progress. But major economic and social unknowns persist on how AI systems deploy.

Anthropic’s Long Term Benefit Trust

Dario believes some companies already make hundreds of millions to billions from AI, but trillions remain hard to predict given integration difficulties.
Valuations and revenues are growing fast, but still small compared to technology progress. Timing the integration is challenging.
Anthropic's Long Term Benefit Trust (LTBT) oversees the board to balance stakeholder interests beyond shareholders. This was necessary for investment.
Investors accept or debate merits of the LTBT, but it aligns all parties about Anthropic's mission.
Many Anthropic employees have physics backgrounds, rapidly adapting to AI given foundational technical strengths.
Dario acknowledges hiring physicists pivoting to AI could increase long-term risks. But most interest was likely inevitable with AI's prominence.

In summary, the LTBT helps Anthropic balance stakeholders amidst fast growth and uncertainty. Physics training proves adaptable to AI's technical demands. But Dario remains cognizant of complex effects on the talent ecosystem.

Is Claude conscious?

Consciousness

Dario is uncertain if models like Claude have conscious experience. He previously thought it unlikely without embodiment and rewards.
But modern language models contain more cognitive machinery than expected, so he no longer rules it out. Still probably not smart enough for consciousness yet.
If models did have conscious experience comparable to animals, it would be concerning not knowing if interventions help or harm them. Interpretability may provide insight.

Human Intelligence

The "scaling hypothesis" constituted a major realization that intelligence can arise from simple gradients without special conditions.
Dario is surprised skills develop unevenly in models, rather than clicking together. Human cognition may be similar - good at some narrow skills earlier than others.
Abstract theories of intelligence dissolve when actually training models. Dario focuses on capabilities demonstrated rather than concepts.

Conviction in Scaling Laws

The smooth scaling curves are surprising in hindsight. Dario wonders why the power of scale wasn't more obvious sooner given the empirical evidence.

In summary, Dario expresses uncertainty but concern around model consciousness. He sees intelligence as grounded in empirical capabilities, not abstractions. And scaling trends were surprisingly smooth given how fundamental scale is to progress.

"In the symphony of scaling, mysteries and momentum converge."