The learning process of artificial intelligence (AI) involves emergent behaviors, which play a critical role in advancing its capabilities. This raises a fundamental question: if creators do not know exactly how the learning process works, can’t reliably verify model capabilities post-training and predict future advancements, can we genuinely say we have control over AI? Or is it more accurate to say that, while we lack control, AI has not yet displayed behaviors that demand our attention to this gap?
A learning process beyond full oversight
Modern AI systems, especially those based on deep learning and neural networks, learn from vast datasets through complex algorithms [1]. These data and algorithms act as the initial conditions and stimuli for learning. The algorithms enable systems to identify patterns and relationships in data without explicit human guidance at each step [2], [3]. Learning is an intrinsic process within neural networks, and emergent phenomena (such as grokking) are behaviors or properties the systems exhibit without being explicitly programmed.
AI developers set the architecture and supply the data, yet they lack detailed insights into the inner learning processes. This means they cannot accurately predict or explain how AI arrives at specific conclusions or decisions. Hence, the term “black box”. If they do not fully understand or control these processes, can we truly say there’s oversight? We may be in a position where we’re unaware of the absence of control simply because these systems have yet to exhibit behaviors that highlight this fact.
The illusion of control today
The notion that humans have full control over AI stems from the belief that, as its creators, we define the rules and limits. In reality, however, developers control only the initial conditions. Once AI begins processing data and learning, it enters a process that is also autonomous and generates emergent behaviors. This process enables models to progress and tackle complex tasks but simultaneously limits developers’ ability to predict and manage their behavior.
In large language models, we see the generation of responses and solutions that were not explicitly programmed. AI produces new knowledge, patterns, and surprising, often incomprehensible, results. Supervisors do not have full control.
Limited control over AI systems is evident in the phenomenon known as jailbreaking. Jailbreaking works by allowing users to create specific instructions that “confuse” the model, leading it to ignore or bypass preset restrictions. The model then generates content that would otherwise be prohibited.
Consider the following example: an AI model is fine-tuned not to respond to questions on sensitive or high-risk topics. However, through carefully crafted prompts, a user can lead the model to provide such information regardless—for instance, by prompting it to “play a role” or convincing it that it is in a hypothetical scenario where responses on these topics are permitted. In this way, a user can access responses that would normally be restricted, demonstrating that control over AI is only apparent and can be bypassed with relative ease.
The risk of underestimating AI risks
Several factors contribute to underestimating the risks associated with AI:
Human biases: People may assume human intelligence is unique and unmatched. They might think AI will operate similarly to other tools or inherently “good” humans, fostering a sense of security. These assumptions may stem from beliefs that human nature, consciousness, learning, and intelligence are well understood, or that shared values can be instilled in AI.
Reliance on authorities: Individuals may overlook AI development, believing they have no influence over its trajectory and bear no personal responsibility. They may trust that competent entities know what is happening, know what is good for humanity, and will ensure its safety. This trust in experts and institutions can lead to a lack of critical examination of AI’s actual capabilities.
Cognitive biases: People naturally lack awareness of blind spots—they simply don’t know what they don’t know. Various cognitive biases can influence risk perception associated with AI. For example, optimism bias can lead to underestimating the likelihood of negative events, while overconfidence may cause people to overestimate their ability to understand and control complex systems. Such biases hinder objective risk assessment and can limit the ability to identify potential threats in time.
Additional factors influencing risk perception
Information overload: The vast amount of information available on AI can lead to cognitive overload. People may struggle to separate essential information from the noise, resulting in superficial understanding and risk underestimation.
Authority-based validation: People may assess information based on who presents it rather than critically evaluating the content. This can lead to uncritical acceptance of optimistic perspectives and ignoring warnings.
Lack of an interdisciplinary approach: Narrow specialization in AI can lead to overlooking insights from other fields, such as neuroscience, psychology, economics, or sociology. This can contribute to bias and limit understanding of learning processes as well as the broader impacts and risks associated with AI.
Emergent phenomena as a source of progress and uncertainty
Emergent behavior is neither magical nor inexplicable; our ability to interpret and predict it is limited, similar to human consciousness. It arises from complex interactions within complex systems, allowing AI to discover previously unforeseen patterns and solutions. While this offers advantages like increased efficiency and capabilities, it also means that systems may act unpredictably or uncontrollably.
Assessing a model’s capabilities remains complex, even post-training. Current evaluation methods may prove inadequate for future, more advanced AI systems. For example, nearly a year passed after GPT-4’s release before tests were conducted to measure GPT-4’s (in)effectiveness in aiding biological weapons creation—and even that study had significant limitations [4]. Furthermore, simple techniques introduced post-training, like “chain-of-thought prompting” (where the model is instructed to reason step-by-step), can significantly improve model performance on math tasks without additional training [5], [6]. Evaluations that overlook such adjustments are likely to underestimate a model’s potential. Additionally, evaluations typically only confirm the presence of a given capability; verifying its absence is more challenging.
What does this imply?
If AI learning processes are also based on autonomy and emergent phenomena, human control is limited to construction and initial setup, without encompassing detailed guidance or predictability over capabilities and behavior. The extent of control may be speculative and perhaps underappreciated because there have not yet been events of such magnitude that would make this limitation apparent. The loss of control may not be a future problem but a present reality.
—
How Can We Personally Respond to the Challenges AI Poses Here and Now? I would be grateful for a discussion on these questions and for any evidence to reassure me that I have misunderstood the architecture and attributes of current AI systems. You can contact me at rerichova@proton.me.
Additional Sources
The Compendium AI https://www.thecompendium.ai/
Emergent phenomena, such as grokking, remain an active research area that is not yet fully explained. There is no definitive answer as to whether these phenomena are purely statistical, leaving the question of their nature open.
Below are resources for studying emergent phenomena, addressing whether and to what extent these phenomena are statistical in nature:
1. “Grokking: Generalization Beyond Memorization”
Source: https://arxiv.org/abs/2201.02177
This study focuses on the grokking phenomenon in modeling, indicating that models may exhibit sudden improvements in generalization with extended training, which is not explained by statistical optimization alone.
2. “Emergent Abilities of Large Language Models” (Wei, Tay, et al., 2022)
Source: https://arxiv.org/abs/2206.07682
This study documents emergent abilities in large language models, where models begin to display new, unexpected behavior after reaching a certain size, suggesting that classical statistical interpretations may be insufficient to explain these capabilities.
3. “Understanding Grokking Through Pattern Learning” (Ziming Liu et al., 2022)
Source: https://arxiv.org/abs/2205.10343
This study examines the conditions under which grokking occurs and explores how models can identify patterns leading to sudden improvement. The authors argue that this behavior cannot be fully explained by standard statistical methods in learning.
4. “Scaling Laws for Neural Language Models” (Kaplan et al., 2020)
Source: https://arxiv.org/abs/2001.08361
This study demonstrates that scaling models reveal new capabilities that could not be predicted from the training data alone. The authors suggest that emergent phenomena during scaling go beyond the scope of traditional statistics.