Climbing False Heights: O’Reilly

Human beings are notoriously poor at judging distances. There is a tendency to underestimate whether it is the distance along a straight road with a clear run to the horizon or the distance across a valley. When climbing to a peak, the estimate is further confused by false peaks. What you thought was your goal and end point turns out to be a lower peak or just an outline that, from below, looked like a peak. You thought you had made it, or at least been close to you, but there is still a long way to go.

The history of AI is a history of marked progress, but it is also the history of (many) false summits.


Learn faster. She digs deeper. See further.

In the 1950s, automatic translation of Russian into English was considered no more complex than dictionary searches and pattern-based sentences. Natural language processing has come a long way since then, having burned out some good paradigms to come up with something we can use on a daily basis. In the 1960s, Marvin Minsky and Seymour Papert proposed the Summer Vision Project for college students: connect a TV camera to a computer and identify objects in the field of view. Computer vision is now something that is being commodified for specific tasks, but it continues to be a work in progress and, around the world, has required more than a few summers (and winters of artificial intelligence) and many more than a few college students.

We can find many more examples over many more decades that reflect naivety and optimism and, if we are honest, not a little ignorance and arrogance. The two general lessons to learn here are not that machine translation involves more than searches and that machine vision involves more than edge detection, but that when we are faced with complex problems in unknown domains, we should be cautious about anything that it seems simple at first glance, and that when we have successful solutions for a specific fragment of a complex domain, we should not assume that such solutions are generalizable. This type of humility is likely to produce more significant progress and a more measured understanding of that progress. It is also likely to reduce the number of experts making fun of past predictions and ambitions in the future, along with the recurring irony of machine learning experts who seem unable to learn from past trends in their field.

All of which brings us to DeepMind’s Gato and the claim that the summit of general artificial intelligence (AGI) is at hand. The hard work has been done and reaching AGI is now a simple matter of downsizing. At best, this is a false summit on the right path; at worst, it is a local high away from AGI, which lies along a very different path in a different range of architectures and thinking.

DeepMind’s Gato is an artificial intelligence model that can be taught to perform many different types of tasks based on a single transformer neural network. The 604 activities that Gato has been trained on range from playing Atari video games to chatting, navigating simulated 3D environments to following instructions, subtitles and imagery to real-time real-world robotics. The noteworthy result is that he is supported by a single model trained on all activities rather than by different models for different activities and modalities. Learning to win Space Invaders doesn’t interfere with or supplant the ability to conduct a chat conversation.

Gato was intended to “test the hypothesis that it is possible to train an agent who is generally capable of a large number of tasks; and that this general agent can be adapted with little extra data to be successful in even more tasks. In this he succeeded. But to what extent can this success be generalized in terms of higher ambitions? the tweet which provoked a wave of responses (including this one) came from DeepMind’s research director, Nando de Freitas: “Now it’s all about scale! Game is over!”

The game in question is AGI research, which is closer to what science fiction and the general public view as AI than the more restricted but applied, task-oriented statistical approaches that make up machine learning (ML) in practice. commercial.

The claim is that AGI is now simply a matter of improving performance, both hardware and software, and making models bigger, using more data and more data types in more modalities. Sure, there is research work to do, but now it is a question of turning the dials until 11 and beyond and, voila, we will have climbed the north face of the AGI to plant a flag on the summit.

It’s easy to get out of breath at high altitude.

When we look at other systems and scales, it is easy to be attracted to superficial similarities in the small and project them into the large. For example, if we look at water swirling down a drain and then out into the cosmos at spiral galaxies, we see a similar structure. But these spirals are more closely related in our desire to see connection than in physics. Looking at the downsizing of specific AI in AGI, it’s easy to focus on tasks as the basic unit of intelligence and skill. What we know about intelligence and learning systems in nature, however, suggests that the relationships between tasks, intelligence, systems, and adaptation are more complex and more subtle. Simply increasing a skill size can simply increase a skill size without triggering an emerging generalization.

If we look closely at software, society, physics or life, we see that downsizing is usually accompanied by fundamental changes in the organization of the principle and process. Each downsizing of an existing approach is successful up to a point, beyond which a different approach is required. You can run a small business using office tools, such as spreadsheets and a social media page. Reaching the Amazon ladder isn’t a question of larger, more page spreadsheets. Large systems have radically different architectures and properties than the smaller systems they are built from or the simpler systems that preceded them.

It may be that general AI is a much more significant challenge than adopting activity-based models and increasing data, speed and number of activities. We generally underestimate the complexity of such systems. We divide and simplify, we make progress as a result, only to find, as we go on, that simplification was just that; a new model, paradigm, architecture or planning is needed to make further progress. Rinse and repeat. In other words, just because you have arrived at base camp, what makes you think you can reach the summit using the same approach? What if you can’t see the summit? If you don’t know what you’re aiming for, it’s hard to chart a path to it.

Instead of assuming the answer, we must ask ourselves: how do we define AGI? Is AGI simply activity-based AI for N assets and a sufficiently large value of N? And, even if the answer to this question is Yes, is the path towards AGI necessarily focused on activities? How much of AGI is the performance? How much of AGI is bigger / bigger / bigger than data?

When we look at life and existing learning systems, we learn that scale is important, but not in the sense suggested by a simple multiplier. It may be that the trick to deciphering the AGI is to be found in scaling, but down rather than up.

Doing more with less seems to be more important than doing more with more. For example, the GPT-3 language model is based on a network of 175 billion parameters. The first version of DALL-E, the prompt-based image generator, used a 12 billion parameter version of GPT-3; the second improved version used only 3.5 billion parameters. And then there’s Gato, which achieves its multitasking and multimodal capabilities with just 1.2 billion.

These reductions suggest direction, but it’s not clear that Gato, GPT-3 or any other contemporary architecture is necessarily the right vehicle to reach the destination. For example, how many training examples are needed to learn something? For biological systems, the answer is, in general, not many; for machine learning, the answer is, in general, a great deal. GPT-3, for example, developed its own language model based on 45 TB of text. In the course of life, a human being reads and hears the order of a billion words; a child is exposed to tens of millions before starting to speak. Mosquitoes can learn to avoid a particular pesticide after a single non-lethal exposure. When you learn a new game, be it a video, a sport, a table or a card, you usually just need to be told the rules and then play, perhaps with a game or two to practice and clarify the rules. rules, to make a reasonable attempt. Mastery, of course, takes a lot more practice and dedication, but general intelligence isn’t about mastery.

And when we look at the hardware and its needs, consider that while the brain is one of the most energy-hungry organs in the human body, it still has a modest power consumption of around 12 watts. Over the course of a lifetime, the brain will consume up to 10 MWh; the formation of the GPT-3 linguistic model required about 1 GWh.

When it comes to scaling, the game is just beginning.

While hardware and data matter, the architectures and processes that support general intelligence can necessarily be very different from the architectures and processes that underpin current machine learning systems. Throwing faster hardware and all the data in the world at the problem risks seeing diminishing returns, even if that could allow us to scale a false vertex from which we can see the real one.