Welcome to the 21st edition of Black Box. This is the first of a three-part series that I’m calling Literature Review in which I dive into emerging research related to generative AI.
I have a habit of going down Wikipedia rabbit holes when commuting; the subject on a recent train ride to New York City was the Minoan civilization. Everything about the Minoans is incredibly interesting, but I was especially struck by the life of Arthur Evans, the British archaeologist who discovered the culture and excavated the emblematic Palace of Knossos. The man was born into a family of amateur scientists, adventured across Europe for rare artifacts as an Oxford student, became a war reporter cum revolutionary in the Balkans, and had a famously loving marriage until his wife’s premature death — all before he set foot in Crete.
I kept thinking that there should be a movie or show based on his life. Good stories are hard to come by but always in demand, so the fact that no studio had picked this one up was surprising. Then I realized they probably hadn’t gotten to Evans because they’re already flooded with source material. Their problem was going through all of the novels, comics, video games, etc. that could become productions. This is obviously impossible by hand, but what about with AI? Could a neural network forecast how audiences will react to stories and identify entertaining text at scale?
To simplify my exploration, I limited the scope to stories in text since most source material is written. I also defined an entertaining story as one that induces a mix of enjoyable emotions. Then the question becomes whether a model could predict naïve emotional reactions1 to text, which is part of a field known as affective computing. (Note this is different from sentiment analysis, which is based on the much simpler task of identifying emotions expressed in text.)
This felt hard so as a cheeky first pass, I looked into whether it was possible to bypass emotions with attention as a proxy. There is some scientific merit to the idea: In her seminal work establishing affective computing, Rosalind Picard observed that “Good entertainment… holds your attention”. So I was shocked to find nothing on emotional reactions to text. The vast majority of research was instead on emotional reactions to pictures since they could be measured as physiological responses and eye tracking technology is widely available, easy to collect, and and cost efficient to scale. Unfortunately, this would not transfer to text as there are many spurious reasons for readers to pause, such as re-reading to understand complex ideas or to decipher poor handwriting. People also react less visibly to text than other forms of media since written content is processed directly in the mind instead of perceived through the senses first.
Picard also believed emotion was a perceptual process and was particularly fascinated by synesthesia since the intermingling of multiple senses made synesthetes explicitly aware of how they process their perceptions2. Taking inspiration3, I wondered if illustrating text using generative AI or otherwise converting it to something more computationally tractable might help. This approach was a little better researched. In 2017, researchers from Yonsei in South Korea showed a proof of concept for predicting valence-arousal plots to photos using the intuition that objects have base emotional connotations which are then shifted by their background and color mix. Building on this work, a 2021 team led by Stanford trained a model to select the most fitting reaction to visual art from a set of main emotions or suggest its own, plus a written explanation, using a dataset of these created with Mechanical Turk.
Among non-visual techniques, I found this recent MIT study to be the most compelling. They modeled emotion in social situations as a function of the gap between the beliefs that a person expected others to have (according to an intuitive theory of mind) under observed events and their actual actions. This game theoretic approach seemed the most viable for immediate use to me as earlier work by their colleagues in the CS department suggested that LLMs can implicitly represent entities, their states, and their relationships to each other as “semantically necessary consequences”. Then as long as a social situation was grounded in logical expectations, an LLM could predict emotional reactions according to whether those expectations were realized or violated.
Although they are closer to reality, these approaches are clearly not general solutions. I began to suspect that simulating emotional reactions required a fundamentally different technique and started looking into unconventional ideas. The one that made the most sense to me was Melanie Mitchell’s work on analogical reasoning since emotion (at least in the context of stories and art) is an empathetic response, which comes from vicariously experiencing something by approximating it with one’s own experiences — an analogy of sorts. Mitchell goes even further and asserts that analogy-making is critical to machine extrapolation in general because truly understanding a concept means abstractly capturing its semantic kernel and applying it to situations that are very different from the kind of data or tasks on which a model was trained4. Take the concept of a bridge:
Humans can easily understand extended and metaphorical notions such as… “the bridge of a song”, “bridging the gender gap”, “a bridge loan”, “burning one’s bridges”, “water under the bridge”, and so on… Moreover, conceptual structures in the mind make it easy for humans to generate “bridges” at different levels of abstraction; for example, imagine yourself forming a bridge from a couch to a coffee table with your leg, or forming a bridge between two notes on your piano with other notes, or bridging differences with your spouse via a conversation.
To me, this sounds exactly like the difference between sentiment analysis and affective computing. No amount of data or computing power seems to turn a model that processes information mostly literally into one that can infer figurative information like emotional reactions. And while Mitchell is almost single-handedly spearheading this research for now, I think that the future of not just affective computing but AI as a whole is in “fringe” ideas like analogical reasoning. Screenwriters may use LLMs to screen books and scripts one day, but that day depends on radically different techniques5. In the meantime, I’d really like to see that Evans movie. Hollywood, have your people call my people. ∎
Have you tried using LLMs to forecast emotional reactions? Let me know how it went @jwang_18 or reach out on LinkedIn.
By naïve, I mean independent of culture, personal experience, current events, etc. such that the prediction is an expected value.
I am also curious if, given sufficient pairs of synesthetic associations, it could be possible to train a model to predict responses to novel stimuli. Affirmative implies that perception occurs through consistent pathways and would be using neural networks to understand cognitive science - coming full circle, as it were.
Indeed, there is evidence that everyone is ideasthetic to some degree. A common example is the bouba/kiki effect.
This challenge is known as out-of-distribution generalization, which is a major open problem in machine learning and a key hurdle to AGI.
If you're building or researching in this space, I'd love to chat! Reach out at wngjj[dot]61[at]gmail.com.