The Good Stuff
Although we have had very high resolution scans of people as avatars for many years now (check out the High Fidelity YouTube channel for examples), what has been missing is the ability to accurately remap detailed facial expressions from a person to an avatar. As we’ve all experienced, most 3D avatars (even including ones in movies) lack expressiveness - they do not copy the tremendous nuance of non-verbal communication conveyed by the human face.
But these avatars, at least from looking at the videos, seem to solidly demonstrate detection and conveyance of the sort of nuanced expressions that make Mark and Lex distinctly themselves. Step back a few feet from the screen and you often can’t even tell it isn’t really them. How did they do it?? They did it by using brand new AI/ML techniques to translate from what the cameras in the headset are seeing (basically their eyes and a closeup of their mouth) to how the avatar is rendered. This is *very* different than the historical way we have controlled avatars. What we did in the past (and what the best existing avatar systems like VRChat do today) was to try and find a bunch of visual ‘control points’ - for example the corners of the mouth, or the position of the cheekbones - and then ‘puppeteer’ the avatar by trying to move the same control points on the avatar’s face to the location where we detected them on the human face. There are many limitations to this technique. The big problem is that the avatar’s face isn’t physically built the same way a person’s face is built, meaning that pulling a string at the corner of an avatar’s mouth (for example) doesn’t at all move the rest of its face the way the human’s face is moving. So you end up with something that may sometimes be funny or entertaining, but does not succeed at all in capturing the human’s more detailed expressions.
With this new technique (which BTW likely required a substantial period of training that we didn’t see in the video), AI is used to basically ‘deepfake’ the avatar to look exactly like the human. Just as we’ve seen videos where a famous person can be made to say something they didn’t say with AI, in this example the AI makes a 3D avatar look exactly like the human that is operating it. This is a huge step forward in creating believable 3D avatars for the case where the avatar is identical to the human operator.
Furthermore, the full body shot of Lex at the start was an amazing demo of a whole avatar being matched to an operator, including detailed hand movements. Although there was no side-by-side to compare for matching posture, the glitches and uncomfortable joint movements that always go along with live full body motion capture seemed completely gone.
Problems that still need solving
Or at least what I think are the most important ones.
Asserting False Dominance
In the frame above (and in much of the overall video) Mark appears to be looking down his nose at Lex. This is very likely due to an error between the elevation angle at which Mark’s head was captured and the actual way the VR headset is sitting on his face. We encountered this problem as well, and it is quite difficult to fix. Mark would need to very carefully adjust himself to match the headset to the avatar before the session started and then be very careful not to let the headset slip up or down on his face.
When humans look down their nose at someone, it is a signal of dominance. When humans tilt their face downward and look upward at someone, they are signalling submission (think about how dogs put their head on the floor and look up at you). Accidentally signaling dominance to your boss when you didn’t intend it will be a showstopper for the use of VR headsets in meetings.
Whose shoulder’s are these, and where are my hands?
The deep truth underlying the comedy of the whole 'wait, where are my avatar’s legs?’ meme is that the posture of the whole body (not just the face) is a huge part of non-verbal communication. In this regard, both Lex and Mark are better-than-normal communicators as avatars, because they don’t move their bodies and hands very much when they talk (or at least didn’t in this interview). Consider the following frame:
In this frame, Lex is closing his eyes while leaning slightly backward in his chair (look carefully at the angle his neck makes with his upper torso). This is a common (and I think lovable) behavior of Lex… he often pauses and leans back and thinks carefully about what he is proposing. But the avatar looks different - more aggressive, with chin lifted - because the Quest Pro camera can’t see his shoulders. This is because the avatar’s shoulder movements are completely fake - they are just hanging there under the head: Scrub through the video and look more at the shoulders. When Mark or Lex lean forward or back in their chair, you can’t tell. This sort of non-verbal communication is very important. In a video like this it probably doesn’t matter much, but imagine if this way a conversation with your child, or boss. You’d feel the missing information right away.
And, of course, where are the hands!? (to better make the point, say that last sentence to yourself in a mirror in a highly animated fashion). Probably the reason they weren’t rendered in the close-up interview (as opposed to the lead-in full body shot which looks really amazing) is that even though the Pro can detect the hands, the lack of tracking the shoulders and elbows means that attempting to render the hands/arms/shoulders at close range would have resulted in awkward and incorrect movements. Minimally, avatars communicating F2F like this will need to have their shoulders and hands properly tracked.
The delay you can’t feel in the video
When two people talk to each other, if they are comfortable, they start mirroring and synchronizing their body movements. This synchronization is a very important component in how humans (and other mammals) build trust with each other, and is one of the big reasons why face-to-face gatherings are so effective at bonding people. As we all know from a Zoom call, sometimes the delay on the call gets us out of sync with the other person - unable to nod together, and sometimes interrupting each other unintentionally. These delays are still there in the VR experience because the speed of light is a harsh mistress: halfway around the world, in the very best possible network case, is still going to add a very noticeable tenth of a second delay. Worse yet, AI processing (of the sort that made this demo so amazing) adds more delay to the experience.
When we watch a video as a bystander - as we do in this demo - we cannot feel the delay. But as a participant we most certainly can. So until I (or you) get a chance to try this experience yourself, remember that delay is still a big factor in whether one can create a truly intimate experience. Why do you think Lex does almost all his podcasts F2F?
But wait… why use headsets at all?
This amazing video from researcher Johnny Lee shows you how a 2D screen can become a 3D screen if you can detect the location of the observer’s head. If you haven’t seen it, check it out right now. The video is 15 years old, and Johnny detected the head position with a cleverly hacked Nintendo Wii controller. The good news is that now we have much faster computers and machine learning, meaning that a standard webcam on a laptop can detect head position as accurately as is demonstrated here. So this means that a webcam can make the screen into a magic ‘portal’ through which you can see the other person in 3D.
If this technique had been used (instead of wearing headsets) for Mark and Lex, they would actually also have been able to enjoy the eye contact that made their experience such a huge improvement over Zoom. That is because when you render the scene and the avatars exactly in the right place, eye contact works - even on the 2D screen. And, BTW, the ‘false dominance’ problem from above goes away if you are able to actually detect the angle of the user’s face (as compared to only knowing the angle of the user’s headset).
So, in summary, this whole experience could have been done even better and without the pesky uncomfortable headsets by using standard webcams and laptops. I have a hard time understanding why people would want to buy a Quest Pro if they could do the same thing with nothing attached to their face. Regardless of how cool the avatars were, most people can’t comfortably wear a Quest for longer than about 30 minutes, and this interview was over an hour! It seems possible that AI/ML might help us replace the HMD altogether, at least for experiences like this.