COMPLEXITY

Nature of Intelligence, Ep. 4: Babies vs Machines

Episode Summary

There’s an argument to be made that if we train AI systems to learn the way babies do, we’ll get them closer to human-like intelligence. But how our own learning development functions in babyhood is still a mystery that researchers are untangling. We know that the information babies absorb is very different from how an LLM learns, and in today’s episode, with guests Linda Smith and Michael Frank we’ll attempt to look at the world through an infant’s eyes and examine why they’re able to do more with, seemingly, less information.

Episode Notes

Guests:

Linda Smith, Distinguished Professor and Chancellor's Professor, Psychological and Brain Sciences, Department of Psychological and Brain Sciences, Indiana University Bloomington
Michael Frank, Benjamin Scott Crocker Professor of Human Biology, Department of Psychology, Stanford University

Hosts: Abha Eli Phoboo & Melanie Mitchell

Producer: Katherine Moncure

Podcast theme music by: Mitch Mignano

More info:

Tutorial: Fundamentals of Machine Learning
Lecture: Artificial Intelligence
SFI programs: Education

Books:

Artificial Intelligence: A Guide for Thinking Humans by Melanie Mitchell

Talks:

Why "Self-Generated Learning” May Be More Radical and Consequential Than First Appears by Linda Smith
Children’s Early Language Learning: An Inspiration for Social AI, by Michael Frank at Stanford HAI
The Future of Artificial Intelligence by Melanie Mitchell

Papers & Articles:

“Curriculum Learning With Infant Egocentric Videos,” in NeurIPS 2023 (September 21)
“The Infant’s Visual World The Everyday Statistics for Visual Learning,” by Swapnaa Jayaraman and Linda B. Smith, in The Cambridge Handbook of Infant Development: Brain, Behavior, and Cultural Context, Chapter 20, Cambridge University Press (September 26, 2020)
“Can lessons from infants solve the problems of data-greedy AI?” in Nature (March 18, 2024), doi.org/10.1038/d41586-024-00713-5
“Episodes of experience and generative intelligence,” in Trends in Cognitive Sciences (October 19, 2022), doi.org/10.1016/j.tics.2022.09.012
“Baby steps in evaluating the capacities of large language models,” in Nature Reviews Psychology (June 27, 2023), doi.org/10.1038/s44159-023-00211-x
“Auxiliary task demands mask the capabilities of smaller language models,” in COLM (July 10, 2024)
“Learning the Meanings of Function Words From Grounded Language Using a Visual Question Answering Model,” in Cognitive Science (First published: 14 May 2024), doi.org/10.1111/cogs.13448

Episode Transcription

Abha Eli Phoboo: The voices you’ll hear were recorded remotely across different countries, cities and work spaces.

Linda Smith: The data for training children has been curated by evolution. This is in sort of contrast to all the large data models, right? They just scrape everything. Would you educate your kid by scraping off the web?

[THEME MUSIC]

Abha: From the Santa Fe Institute, this is Complexity.

Melanie Mitchell: I'm Melanie Mitchell.

Abha: And I'm Abha Eli Phoboo.

[THEME MUSIC FADES OUT]

Abha: So far in this season, we've looked at intelligence from a few different angles, and it's clear that AI systems and humans learn in very different ways. And there's an argument to be made that if we just train AI to learn the way humans do, they'll get closer to human-like intelligence.

Melanie: But the interesting thing is, our own development is still a mystery that researchers are untangling. For an AI system like a large language model, the engineers that create them know, at least in principle, the structure of their learning algorithms and the data that's being fed to them. With babies though, we're still learning about how the raw ingredients come together in the first place.

Abha: Today, we're going to look at the world through an infant's eyes. We know that the information babies are absorbing is very different from an LLM's early development. But how different is it? What are babies experiencing at different stages of their development? How do they learn from their experiences? And how much does the difference between babies and machines matter?

Abha: Part One: The world through a baby's eyes

Abha: Developmental psychology, the study of how cognition unfolds from birth to adulthood, has been around since the late 19th century. For the first 100 years of its history, this field consisted of psychologists observing babies and children and coming up with theories. After all, babies can't tell us directly what they’re experiencing.

Melanie: But what if scientists could view the world through a baby's own eyes? This has only become possible in the last 20 years or so. Psychologists are now able to put cameras on babies' heads and record everything that they see and hear. And the data collected from these cameras is beginning to change how scientists think about the experiences most important to babies' early learning.

Linda: I'm Linda Smith, and I'm a professor at Indiana University. I'm a developmental psychologist, and what I am interested in and have been for a kind of long career, is how infants break into language. And some people think that means that you just study language, but in fact, what babies can do with their bodies, how well they can control their bodies, determines how well they can control their attention and what the input is, what they do, how they handle objects, whether they emit vocalizations, all those things play a direct role in learning language. And so I take a kind of complex or multimodal system approach to trying to understand the cascades and how all these pieces come together.

Melanie: Linda Smith is the Chancellor's Professor of Psychological and Brain Sciences at Indiana University. She’s one of the pioneers of head-mounted camera research with infants.

Linda: I began putting head cameras on babies because people have throughout my career, major theorists, have at various points made the point that all kinds of things were not learnable. Language wasn't learnable. Chomsky said that basically. All this is not learnable. The only way you could possibly know it was for it to be a form of pre-wired knowledge. It seemed to me even back in the 70s, that my thoughts were, we are way smarter than that. And I should sure hope that if I was put on some mysterious world in some matrix space or whatever, where the physics work differently, that I could figure it out. But we had no idea what the data are. Most people assume that at the scale of daily life, massive experience, the statistics are kind of the same for everybody. But by putting head cameras on babies, we have found out that they are absolutely, and I'm not alone in this, there's a lot of people doing this, we have found out that it is absolutely not the same.

Melanie: Linda’s talking about the statistics of the visual world that humans experience. We perceive correlations — certain objects tend to appear together, for example chairs are next to tables, trees are next to shrubs, shoes are worn on feet. Or at an even more basic, unconscious level, we perceive statistical correlations among edges of objects, colors, certain properties of light, and so on. We perceive correlations in space as well as in time.

Abha: Linda and others discovered that the visual statistics that the youngest babies are exposed to, what they’re learning from in their earliest months, are very different from what we adults tend to see.

Linda: There they are in the world, they're in their little seats, you know, looking, okay, or on somebody's shoulder looking. And the images in front of their face, the input available to the eye changes extraordinarily slowly and slow is good for extracting information in the first three months, babies make remarkable progress, both in the tuning of the foundational periods of vision, foundational aspects of vision, edges, contrast sensitivity, chromatic sensitivity. But it's not like they wait till they get all the basic vision worked out before they can do anything else. The first three months to find the period of faces, they recognize parents' faces. They become biased in faces. If they live in one ethnic group, they can recognize those faces better and discriminate them better than if they live in another. And all this happens by three months. And some measures suggest that the first three to four months, this is Daphne Mauer's amazing work of babies with cataracts that if you don't have a cataract removed before four months of age for infantile cataracts, that human face perception is disrupted for life. And that's likely in the lower level neural circuits, although maybe it's in the face ones as well. And babies who are three months old can discriminate dogs from cats. I mean, it's not like they're not learning anything. They are like building a very impressive visual system.

Many of our other mammalian friends get born and immediately get up and run around. We don't. Got to believe it's important, right?

Melanie: Linda and her collaborators analyzed the data from head-mounted cameras on infants. And they found that over their first several months of life, these infants are having visual experiences that are driven by their developing motor abilities and by their interactions with parents and other caregivers. And that process unfolds in a way that enables them to efficiently learn about the world. The order in which they experience different aspects of their visual environment actually facilitates learning.

Linda: It’s a principle of learning, not a principle of the human brain. It's a principle of the structure of data. I think what Mother Nature is doing is, it's taking the developing baby who's got to learn everything in language and vision and holding objects and sounds and everything, okay, and social relations and controlling self-regulation. It is taking them on a little walk through the solution space. The data for training children has been curated by evolution. This is in sort of contrast to all the large data models, right? They just scrape everything. Would you educate your kid by scraping off the web? I mean, would you train your child on this? So anyway, I think the data is important.

Abha: Another developmental psychologist who’s focused on babies and the data they experience is Mike Frank.

Mike Frank: I'm Mike Frank. I'm a professor of psychology at Stanford, and I'm generally interested in how children learn. So how they go from being speechless, wordless babies to, just a few years later, kids that can navigate the world. And so the patterns of growth and change that support that is what fascinates me, and I tend to use larger data sets and new methodologies to investigate those questions. When I was back in grad school people started working with this new method, they started putting cameras on kids' heads. And so Pavan Sinha did it with his newborn and gave us this amazing rich look at what it looked like to be a newborn perceiving the visual world. And then pioneers like Linda Smith and Chen Yu and Karen Adolf and Dick Aslan and others started experimenting with the method and gathering these really exciting data sets that were maybe upending our view of what children's input looked like. And that's really critical because if you're a learning scientist, if you're trying to figure out how learning works, you need to know what the inputs are as well as what the processes of learning are. So I got really excited about this. And when I started my lab at Stanford, I started learning a little bit of crafting and trying to build little devices. We'd order cameras off the internet and then try to staple them onto camping headlamps or glue them on at a little aftermarket fisheye lens. We tried all these different little crafty solutions to get something that kids would enjoy wearing. At that time we were in advance of the computer vision technologies by probably about five or seven years, so we thought naively that we could process this flood of video that we were getting from kids. And put it through computer vision and have an answer as to what the kids were seeing and it turned out the vision algorithms failed completely on these data. They couldn't process it at all, in part because the cameras were bad. And so they would have just a piece of what the child was seeing, and in part because the vision algorithms were bad because they were trained on Facebook photos, not on children's real input. And so they couldn't process these very different angles and very different orientations and kind of occlusions, cutting off faces and so forth. So, that was how I got into it was thinking I could, you know, use computer vision to measure children's input. And then it turned out I had to wait maybe five or seven years until the algorithms got good enough that that was true. So what are the most interesting things people have learned from this kind of data? Well, as somebody interested in communication and social cognition and little babies, I thought the discovery, which I think belongs to Linda Smith and to her collaborators, the discovery that really floored me was that we'd been talking about gaze following and looking at people's faces for years that human gaze and human faces were this incredibly rich source of information. And then when we looked at the head mounted camera videos, babies actually didn't see faces that often because they're lying there on the floor. They're crawling. They're really living in this world of knees. And so it turned out that when people were excited to, you know, spend time with the baby or to manipulate their attention, they would put their hands right in front of the baby's face and put some object right in the baby's face. And that's how they would be getting the child's attention or directing the child's attention or interacting with them. It's not that the baby would be looking way up there in the air to where the parent was and figuring out what the parent was looking at. So this idea of sharing attention through hands and through manipulating the baby's position and what's in front of the baby's face, that was really exciting and surprising as a discovery. And I think we've seen that borne out in the videos that we take in kids homes.

Abha: And doing psychological research on babies doesn’t come without its challenges.

Mike: You know, if you want to deal with the baby, you have to recruit that family, make contact with them, get their consent for research. And then the baby has to be in a good mood to be involved in a study or the child has to be willing to participate. And so we work with families online and in person. We also go to local children's museums and local nursery schools. And so often for each of the data points that you see, at least in a traditional empirical study, that's hours of work by a skilled research assistant or a graduate student doing the recruitment, actually delivering the experience to the child.

Melanie: Over the last several years, Mike and his collaborators have created two enormous datasets of videos taken by head-mounted cameras on children from six months to five years old. These datasets are not only being used by psychologists to better understand human cognitive development, but also by AI researchers to try to train machines to learn about the world more like the way babies do. We’ll talk more about this research in Part 2.

Melanie: Part 2: Should AI systems learn the same way babies do?

Melanie: As we discussed in our previous episode, while large language models are able to do a lot of really impressive things, their abilities are still pretty limited when compared to humans. Many people in the AI world believe that if we just keep training large language models on more and more data, they’ll get better and better, and soon they’ll match or surpass human intelligence.

Abha: But other AI researchers think there’s something fundamental missing in the way these systems work, and in how they are currently trained. But what's the missing piece? Can new insights about human cognitive development create a path for AI systems to understand the world in a more robust way?

Linda: I think the big missed factor in understanding human development and human intelligence is understanding the structure, the statistics of the input. And I think the fail point of current AI definitely lies, I think, in the data. And I'd like to make the data used for training, and I'd like to make a case that that is the biggest fail point.

Abha: Today's neural networks are typically trained on language and images scraped from the web. Linda and other developmental psychologists have tried something different — they've trained AI neural networks on image frames from the videos collected from head-mounted cameras. The question is whether this kind of data will make a difference in the neural networks' abilities.

Linda: If you train them, pre-train them with babies visual inputs, 400 million images, and you order them from birth to 12 months of age, what we call the developmental order, versus you order them backwards from oldest to youngest, or if you randomize them, that the developmental order leads in a trained network that is better to learn the name for actions in later training, to learn object names in later training. Not everybody is interested in this. They just, they bought into the view that if you get enough data, any data, everything ever known or said in the world, okay, that you will be smart. You'll be intelligent. It just does not seem to me that that's necessarily true. There's a lot of stuff out there that's not accurate, dead wrong, and odd. Just scraping massive amounts of current knowledge that exist of everything ever written or every picture ever taken, okay? It's just, it's not ideal.

Melanie: Is it a matter of getting better data or getting better sort of ordering of how you teach these systems or is there something more fundamental missing?

Linda: I don't think it's more fundamental actually, okay. I think it's better data. I think it's multimodal data. I think it's data that is deeply in the real world, not in human interpretations of that real world, but deeply in the real world, data coming through the sensory systems It's the raw data. It is not data that has gone through your biased, cultish views on who should or should not get funded in the mortgage, not biased by the worst elements on the web's view of what a woman should look like, not biased in all these ways. It's not been filtered through that information. It is raw, okay? It is raw.

Abha: Linda believes that the structure of the data, including its order over time, is the most important factor for learning in both babies and in AI systems. I asked her about the point Alison Gopnik made in our first episode: how important is it that the learning agent, whether it's a child or a machine, is actively interacting in the real world, rather than passively learning from data it’s given? Linda acknowledges that this kind of doing rather than just observing — being able to, through one's movements or attention, to actually generate the data that one’s learning from — is also key.

Linda: I think you get a lot by observing, but the doing is clearly important. So this is the multimodal enactive kind of view, which I think, doesn't just get you data from the world at the raw level, although I think that would be a big boon, okay? From the real world, not photographs, okay? And in time. What I do in the next moment, what I say to you, depends on my state of knowledge. Which means that the data that comes in at the next moment is related to what I need to learn or where I am in my learning. Because it is what I know right now is making me do stuff. That means a learning system and the data for learning, because the learning system generates it, are intertwined. It's like the very same brain that's doing the learning is the brain that's generating the data.

Abha: Perhaps if AI researchers focused more on the structure of their training data rather than on sheer quantity, and if they enabled their machines to interact directly with the world rather than passively learning from data that’s been filtered through human interpretation, AI would end up having a better understanding of the world. Mike notes that, for example, the amount of language current LLMs are trained on is orders of magnitude larger than what kids are exposed to.

Mike: So modern AI systems are trained on huge data sets, and that's part of their success. So you get the first glimmerings of this amazing flexible intelligence that we start to see when we see GPT-3 with 500 billion words of training data. It's a trade secret of the companies how much training data they use, but the most recent systems are at least in the 10 trillion plus range of data. A five -year -old has maybe heard 60 million words. That'd be a reasonable estimate. That's kind of a high estimate for what a five-year-old has heard. So that's, you know, six orders of magnitude different in some ways, five to six orders of magnitude different. So the biggest thing that I think about a lot is how huge that difference is between what the child hears and what the language model needs to be trained on. Kids are amazing learners. And I think by drawing attention to the relative differences in the amount of data that kids and LLMs get, that really highlights just how sophisticated their learning is.

Melanie: But of course they're getting other sensory modalities like vision and touching things and being able to manipulate objects. Is that gonna make a big difference with the amount of training they're gonna need?

Mike: This is right where the scientific question is for me, which is what part of the child as a system, as a learning system or in their broader data ecosystem makes the difference. And you could think, well, maybe it's the fact that they've got this rich visual input alongside the language. Maybe that's the really important thing. And then you'd have to grapple with the fact that adding, just adding pictures to language models doesn't make them particularly that much smarter. At least in the most recent commercial systems, adding pictures makes them cool and they can do things with pictures now, but they still make the same mistakes about reasoning about the physical world that they did before.

Abha: Mike also points out that even if you train LLMs on the data generated by head-mounted cameras on babies, that doesn't necessarily solve the physical reasoning problems.

Melanie: In fact, sometimes you get the opposite effect, where instead of becoming smarter, this data makes these models perform less well. As Linda pointed out earlier, there's something special about having generated the data oneself, with one's own body and with respect to what one actually wants to — or needs to — learn.

Mike: There are also some other studies that I think are a bit more of a cautionary tale, which is that if you train models on a lot of human data, they still don't get that good. Actually, the data that babies have appears to be more, not less challenging, for language models and for computer vision models these are pretty new results from my lab, but we find that performance doesn't scale that well when you train on baby data. You go to videos from a child's home, you train models on that. And the video is all of the kid playing with the same truck or the, you know, there's only one dog in the house. And then you try to get that model to recognize all the dogs in the world. And it's like, no, it's not the dog. So that's a very different thing, right? So the data that kids get is both deeper and richer in some ways and also much less diverse in other ways. And yet their visual system is still remarkably good at recognizing a dog, even when they've only seen one or two. So that kind of really quick learning and rapid generalization to the appropriate class, that's something that we're still struggling with in computer vision. And I think the same thing is true in language learning. So doing these kinds of simulations with real data from kids, I think, could be very revealing of the strengths and weaknesses of our models.

Abha: What does Mike think is missing from our current models? Why do they need so many more examples of a dog before they can do the simple generalizations that kids are doing?

Mike: Maybe though it's having a body, maybe it's being able to move through space and intervene on the world to change things in the world. Maybe that's what makes the difference. Or maybe it's being a social creature interacting with other people who are structuring the world for you and teaching you about the world. That could be important. Or maybe it's the system itself. Maybe it's the baby and the baby has built in some concepts of objects and events and the agents, the people around them as social actors. And it's really those factors that make the difference.

Abha: In our first episode, we heard a clip of Alison Gopnik's one-year old grandson experimenting with a xylophone — it's a really interactive kind of learning, where the child is controlling and creating the data, and then they're able to generalize to other instruments and experiences. And when it comes to the stuff that babies care about most, they might only need to experience something once for it to stay with them.

Melanie: But also remember that Alison's grandson was playing music with his grandfather — even though he couldn't talk, he had a strong desire to play with, to communicate with his grandfather. Unlike humans, large language models don't have this intrinsic drive to participate in social interactions.

Mike: A six month old can communicate. They can communicate very well about their basic needs. They can transfer information to other people. There's even some experimental evidence that they can understand a little bit about the intentions of the other people and understand some rudiments of what it means to have a signal to get somebody's attention or to get them to do something. So they actually can be quite good at communication. So communication and language being two different things. Communication enables language and is at the heart of language, but you don't have to know a language in order to be able to communicate.

Melanie: In contrast to babies, LLM's aren't driven to communicate. But they can exhibit what Mike calls "communicative behavior", or what, in the previous episode, Murray Shanahan would have called "role-playing" communication.

Mike: LLMs do not start with communicative ability. LLMs are in the most basic, you know, standard architectures, prediction engines. They are trying to optimize their prediction of the next word. And then of course we layer on lots of other fine-tuning and reinforcement learning with human feedback, these techniques for changing their behavior to match other goals, but they really start basically as predictors. And it is one of the most astonishing parts about the LLM revolution that you get some communicative behaviors out of very large versions of these models. So that's really remarkable and I think it's true. I think you can see pretty good evidence that they are engaging in things that we would call communicative. Does that mean they fundamentally understand human beings? I don't know and I think that's pretty tough to demonstrate. But they engage in the kinds of reasoning about others' goals and intentions that we look for in children. But they only do that when they've got 500 billion words or a trillion words of input. So they don't start with communication and then move to language the way we think babies do. They start with predicting whatever it is that they are given as input, which in the case of LLMs is language. And then astonishingly, they appear to extract some higher level generalizations that help them manifest communicative behaviors.

Abha: In spite of the many differences between LLMs and babies, Mike’s still very excited about what LLMs can contribute to our understanding of human cognition.

Mike: I think it's an amazing time to be a scientist interested in the mind and in language. For 50 years, we've been thinking that the really hard part of learning human language is making grammatical sentences. And from that perspective, I think it is intellectually dishonest not to think that we've learned something big recently, which is that when you train models, relatively unstructured models, on lots of data about language, they can recover the ability to produce grammatical language. And that's just amazing. There were many formal arguments and theoretical arguments that that was impossible, and those arguments were fundamentally wrong, I think. And we have to come to grips with that as a field because it's really a big change. On the other hand, the weaknesses of the LLMs also are really revealing, right? That there are aspects of meaning, often those aspects that are grounded in the physical world that are trickier to reason about and take longer and need much more input than just getting a grammatical sentence. And that's fascinating too. The classic debate in developmental cognitive science has been about nativism versus empiricism, what must be innate to the child for the child to learn. I think my views are changing rapidly on what needs to be built in. And the next step is going to be trying to use those techniques to figure out what actually is built in to the kids and to the human learners. I'm really excited about the fact that these models have not just become interesting artifacts from an engineering or commercial perspective, but that they are also becoming real scientific tools, real scientific models that can be used and explored as part of this broad, open, accessible ecosystem for people to work on the human mind. So just fascinating to see this new generation of models get linked to the brain, get it linked to human behavior and becoming part of the scientific discussion.

Abha: Mike’s not only interested in how LLMs can provide insight into human psychology. He’s also written some influential articles on how experimental practice in developmental psychology can help improve our understanding of LLMs.

Melanie: You've written some articles about how methods from developmental psychology research might be useful in evaluating the capabilities of LLMs. So what do you see as the problems with the way these systems are currently being evaluated? And how can research psychology contribute to this?

Mike: Well, way back in 2023, which is about 15 years ago in AI time, when GPT-4 came out, there was this whole set of really excited responses to it, which is great. It was very exciting technology. It still is. And some of them looked a lot like the following. “I played GPT-4, the transcript of the Moana movie from Disney, and it cried at the end and said it was sad. Oh my god, GPT-4 has human emotions.” Right. And this kind of response to me as a psychologist struck me as a kind of classic research methods error, which is you're not doing an experiment. You're just observing this anecdote about a system and then jumping to the conclusion that you can infer what's inside the system's mind. And, you know, if psychology has developed anything, it's a body of knowledge about the methods and the rules of that game of inferring what's inside somebody else's mind. It's by no means a perfect field, but some of these things are pretty, you know, well described and especially in developmental psych. So, classic experiments have a control group and an experimental group, and you compare between those two groups in order to tell if some particular active ingredient makes the difference. And so minimally, you would want to have evaluations with two different sort of types of material and comparison between them in order to make that kind of inference. And so that's the sort of thing that I have gone around saying and have written about a bit is that you just need to take some basic tools from experimental methods, doing controlled experiments, using kind of tightly controlled simple stimuli so that you know why the LLM or why the child gives you a particular response and so forth, so that you don't get these experimental findings that turn out later to be artifacts because you didn't take care of a particular confound in your stimulus materials.

Melanie: What kind of response have you gotten from the AI community?

Mike: I think there's actually been some openness to this kind of work. There has been a lot of pushback on those initial evaluations of language models. Just to give one kind of concrete example here, I was making fun of people with this human emotions bit, but there were actually a lot of folks that made claims about different ChatGPT versions having what's called theory of mind, that is being able to reason about the beliefs and desires of other people. So the initial evaluations took essentially stories from the developmental psychology literature that are supposed to diagnose theory of mind. These are things like the Sally Anne task.

Abha: You might remember the Sally-Anne test from our last episode. Sally puts an object — let’s say a ball, or a book, or some other thing, in one place and then leaves. And then while Sally’s away, Anne moves that object to another hiding spot. And then the test asks: Where will Sally look for her object when she returns?

Melanie: And even though you and I know where Anne put the book or the ball, we also know that Sally does not know that, so when she returns she’ll look in the wrong place for it. Theory of mind is understanding that Sally has a false belief about the situation because she has her own separate experience.

Abha: And if you give ChatGPT a description of the Sally-Anne test, it can solve it. But we don’t know if it can do it because it’s actually reasoning, or just because it’s absorbed so many examples during its training period. And so researchers started making small changes that initially tripped up the LLMs, like changing the names of Sally and Anne. But LLMs have caught on to those too.

Mike: LLMs are pretty good at those kind of superficial alterations. So maybe you need to make new materials. Maybe you need to actually make new puzzles about people's belief that don't involve changing the location of an item. Right. So people got a lot better at this. And I wouldn't say that the state of the art is perfect now. But the approach that you see in papers that have come out even just a year later is much more sophisticated. They have a lot of different puzzles about reasoning about other people. They're looking at whether the LLM correctly diagnoses why a particular social faux pas was embarrassing or whether a particular way of saying something was awkward. There's a lot more reasoning that is necessary in these new benchmarks. So I think this is actually a case where the discussion, which I was just a small part of, really led to an improvement in the research methods. We still have further to go, but it's only been a year. So I'm quite optimistic that that all of this discussion of methods has actually improved our understanding of how to study the models and also actually improved our understanding of the models themselves.

Abha: So, Melanie, from everything Mike just said, it sounds like researchers who study LLMs are still figuring out the best way to understand how they work. And it’s not unlike the long process of trying to understand babies, too.

Melanie: Right. You know, when I first heard about psychologists putting cameras on babies' heads to record, I thought it was hilarious. But it sounds like the data collected from these cameras is actually revolutionizing developmental psychology! We heard from Linda that the data shows that the structure of the baby's visual experiences is quite different from what people had previously thought.

Abha: Right. I mean, it's amazing that, you know, they don't actually see our faces so much. As Mike mentioned, they're in a world of knees, right? And Linda seems to think that the structuring of the data by Mother Nature, as she put it, is what allows babies to learn so much in their first few years of life.

Melanie: Right. Linda talked about the so-called developmental order, which is the temporal order in which babies get different kinds of visual or other experiences as they mature. And what they see and hear is driven by what they can do with their own bodies and their social relationships. And importantly, it's also driven by what they want to learn, what they're curious about. It's completely different from the way large language models learn, which is by humans feeding them huge amounts of text and photos scraped from the web.

Abha: And this developmental order, I mean, it's also conducive to babies to learn the right things at the right time. And remember Mike pointed out that the way babies and children learn allows them to do more with less. They're able to generalize much more easily than LLMs can. But there's still a lot of mystery about all of this. People are still trying to make sense of the development of cognition in humans, right?

Melanie: And interestingly, Mike thinks that large language models are actually going to help psychologists in this, even though they're so different from us. So for example, LLMs can be used as a proof of principle of what can actually be learned versus what has to be built in and of what kinds of behaviors can emerge, like the communication behavior he talked about. I'm also personally very excited about the other direction, using principles from child development in improving AI systems and also using principles from experimental methodology in figuring out what LLMs are and aren't capable of.

Abha: Yeah. Often it seems like trying to compare the intelligence of humans and computers is like trying to compare apples to oranges. They seem so different. And trying to use tests that are typically used in humans, like the theory of mind test that Mike referred to and Tomer talked about in our last episode, they don't seem to always give us the insights we're looking for. So what kinds of approaches should be used to evaluate cognitive abilities and LLMs? I mean, is there something to be learned from the methods used to study intelligence in non-human animals?

Melanie: Well, in our next episode, we'll look more closely at how to assess intelligence, and if we're even asking the right questions.

Ellie Pavlick: There's a lot of repurposing of existing kind of evaluations that we use for humans. So things like the SAT or the MCAT or something like that. And so it's not that those are like completely uncorrelated with the things we care about, but they're not like very deep or thoughtful diagnostics.

Melanie: That's next time, on Complexity. Complexity is the official podcast of the Santa Fe Institute. This episode was produced by Katherine Moncure, and our theme song is by Mitch Mignano. Additional music from Blue Dot Sessions. I'm Melanie, thanks for listening.