Facial Expressions and Body Language
Neal Stephenson predicted the problem with VR, way back in "Snow Crash"
Every decade or so, I find myself re-reading “Snow Crash”. This time, at the start of chapter 8, on page 59 of the 1992 first edition (the book my wife Yvette bought me for my birthday, saying something at the time like “This is a science fiction novel about that thing you are working on.”), I found this excellent few paragraphs:
Absolutely, exactly correct: facial expressions and body language are critical to the metaverse being useful to most people. Juanita was the designer of the most important capability in the system — the capability that made the metaverse go mainstream: the ability to convey non-verbal information.
Today — 32 years after that book came out, 21 years after my company launched Second Life, and 8 years after the launch of the Oculus Rift — we still haven’t solved this critical problem. Neither VR headsets nor webcams are yet able to transmit non-verbal information well enough for broad adoption.
A simple test that will mark when we have gotten there is this: Would you immediately recognize and be able to predict the emotional state of your spouse or parent if they were using an avatar?
Even the Apple Vision Pro ‘Personas’ - by far the most advanced approach yet shipped to convey non-verbal information - still fall short. Almost everyone still finds them deeply uncanny and uncomfortable, due to the damage done to facial expressions by the inability of the cameras in the headset to fully see the wearer’s face. We are exquisitely sensitive to errors in this information - having been evolved to study faces to understand, for example, whether we think people are telling the truth.
The failure to convey non-verbal cues is what makes the majority of adults uncomfortable participating in virtual worlds, not the lack of better ‘content’ or the need for decentralized frameworks to provide privacy and censor-resistance, as some might claim. The huge success of multiplayer virtual worlds like Roblox and Minecraft for kids (who have a different and reduced need for non-verbal cues) is a clue. Kids don’t care as much about these cues when they are younger, and the evidence so far suggests that as they reach around 16 years old their use of these worlds for social interaction drops off precipitously.
It is humbling and frustrating to have spent almost my entire career working on this problem and failing - first with Second Life and then with High Fidelity. What I wanted to build was a place where people could come together as avatars and comfortably meet, establish trust, and come to build strong relationships. There is a brave minority of adults - the several million that use products like Second Life and VRChat - who have learned to survive in these environments with substantially less non-verbal information. They inspire me with their discoveries of new ways to bridge the gap of reduced communication, such as by mapping gestures and movements to bespoke avatar animations, or extending text with emojis to evolve new languages. But these techniques will not cross the chasm.
My personal take is that the most likely breakthroughs that could solve Stephenson’s challenge will not use VR headsets but instead will use one or more desktop or smartphone cameras and new AI machine learning techniques along with to infer nuanced emotions from the operator and convey them to an avatar. This will enable avatars with the expressiveness of the photorealistic ‘codec avatar’ demos shown by Meta. When those techniques can be brought down to run in real time on typical machines without too painful a training process, we will have a chance of a useful ‘metaverse’ with billions of people.
The fact that I don't have to worry about such cues, and about them being misread, is one of my favorite things about virtual worlds. The emphasis is on what you SAY, not on how you say it, and I like it that way.
In SL LeLutka do a pretty good job with expressions. I change mine to match my mood all the time but I don’t think most do: it’s awkward to scroll through the tiny text based menu in their HUD.
Your comment about the uncanny valley feeling with Apple’s system made me think - maybe rather than trying to make realistic 3D faces convey emotion another approach could be analogous to how a primitive line drawing animation can move an audience.
Perhaps in VR via audio and face sensor tech we could then drive some kind of illustrative emote panel that adjusts in realtime. Thinking Squarepusher’s helmet but less dystopian (or not depending on the setting)