Business Aug 01, 2022

Deep Learning, Video Data, and Low latency Behind Metaverse

Today, let's talk about the hottest topic of the year: the open digital world - the metaverse.

1. The Metaverse Intertwined with Reality

I want to understand digitalization from a more essential point of view, so I want to see what the purest digitalization should be. At present, the purest digital scenario is probably the metaverse.

A relatively native digital world can be an online game. The energy it needs is solar energy. The players inside can also be AI or human players. The game itself is a part of the digital world. The economy can be designed with blockchain. Such a system can be completely independent of human beings.

Many digital transformations are the integration of the physical world and the digital world, which are intertwined with each other. Some key events are completed in the physical world and some key events are completed in the digital world.

2. Deep learning based on Video Data

Personally, I am a fan of the TV series "Westworld". The story provides a methodology to move towards the metaverse - by observing a human being through everything the person sees, hears, touches, and feels, an AI learns and simulates human behavior again and again until the deviation approaches zero.

At the beginning, some smart cars put programs into GTA games to train, while others learn by observing the surrounding environment and human driving operations through camera recordings. AI carries out in-depth learning based on video data and then compares its own judgment with human beings'. Finally, AI can drive with limited information like human beings.

If all people in the real world teach AI how to drive, the progress speed of automatic driving will be amazingly fast, even if it depends on the camera recordings as big data. There have been many practices in some new smart cars. Training based on big data drives the AI iteration speed very fast.

So what kind of technology is needed to move towards the metaverse?

3. Training an AI to outperform human beings

Let's use video data as the in-depth learning data. If AI needs all the data about you for training, Essentially what kind of data does it need? How much data is needed? What is the cost? If it cannot be processed locally, can this data be computed on the cloud?

The resolution of human eyes

It needs about 500 million pixels, which is not something unachievable. Now there are 100 million pixel cameras on mobile phones, which can evolve to meet the requirements very soon.

The sampling frequency of human eyes

The human eyes can hardly detect the refreshing when the sampling frequency is above 120 Hz. You won't feel the freshing in a 24 FPS movie. The shooting games with a sampling frequency of 240 Hz meet the requirements very well too. Human eyes do not demand continuous sampling. Human senses are well satisfied by limited data.

The latency of human reactions

What is the reaction speed of athletes when they hear a gunshot? It usually takes at least 100 milliseconds for a person to hear a sound, pass through nerve transmission, and finally react.

How fast is the nerve speed of human beings? When a person hears something and reacts, it usually takes about 100 milliseconds. The brain reacts quite slowly. If it passes through the cerebellum, the speed can be faster. It is also called subconscious action, but it still has a latency.

Here goes a simple test to help you test your reaction time.

Generally, it takes about 250 milliseconds for an average adult to react, which includes the time for the brain to react and then drive your limbs.

If AI can make judgments and carry out operations within 100 milliseconds in automatic driving, then its reaction speed is better than most human beings.

4. Video recording your whole life

If you use iPhone videos to record your whole life, how much is the cost? Let's do a simple calculation. The video volume per minute is 375 MB, the storage cost per TB is about 111 USD, and the total recording cost for 100 years is about 0.57 million USD, which is not an astronomical figure.

The volume of video recording your whole life is not that big indeed.

If the recording cost can be reduced to less than 15 thousand USD, I believe many people will be interested. After all, A lifetime recorded as a video epitaph will be much more remarkable. Based on your lifetime video, AI can learn from you, imitate you, and constantly compare its judgments and actions with yours. Finally, using AI to simulate your behavior becomes very realistic.

5. The thresholds for the metaverse age

When technology and its economy reach thresholds, will the metaverse become prevalent? Let's take the development of Bluetooth headphones as an example. Once its key pain point was the latency of voice calls.

At that time, the latency of Bluetooth headphones is far beyond 100 milliseconds. When using Bluetooth headphones for calls, we have to pause and wait to hear the other party. The high latency of Bluetooth headphones jeopardizes the phone call experience.

Therefore, before 2015, its market size was less than 1 billion USD. When Apple's AirPods Bluetooth headphones achieved a low latency of lower than 100 milliseconds, the market experienced exponential growth, increasing dozens of times.

When thresholds of key parameters, such as costs and latency, are reached to satisfy people's needs, the market may expand exponentially like what happened to the Bluetooth headphone market after 2015.

6. Shared metaverse to reduce video recording costs

From a technical point of view, video is not the optimal data structure for a metaverse. It is not easy to analyze, nor is it suitable for data sharing. Compared with video, a digital twin model like the UE5, a virtual world engine, may be more suitable.

For example, when tourists shoot videos in the park, tens of thousands of tourists will produce tens of thousands of videos. The video files are very large, and the production and storage costs are very high, which is a burden for the metaverse.

If the park is modeled, and its virtual-world edition is created, tourists can share this virtual-world park and shoot videos in it. Just like the story in "Westworld", everyone can live different stories in the same metaverse. Tourists can produce videos from various virtual camera angles, which greatly reduces production costs.

If the technology is further improved and the cost is further reduced, it is believed that the cost of making videos in metaverse will be lower than that made by human beings in the real world. It won't need so much data to record your whole life, because most of the backgrounds were public metaverse scenes.

7. Last but not the least

Today, the video produced by the game engine can replicate its counterpart shot in the real world.

In the SIGGRAPH 2021, a top-level conference of computer graphics, NVIDIA revealed the fact through a documentary that, as a speaker in the video, Jensen Huang, a co-founder of Nvidia Corporation, was a digital figure created by AI technology for a video clip of 14 seconds played in the GTC conference in April 2021.

Now, please rethink, when your scene data (the scene and yourself) can be transferred to the cloud and processed within 100 milliseconds, will the age of the metaverse still be far away to reach?

You might want to contact us to speak with a solution architect about ZEGOCLOUD metaverse solution if you are working on a metaverse-based product.

  1. ZEGOCLOUD metaverse solution empowers you to build an immersive social world

2. How ZEGOCLOUD metaverse solution helps you build various scenarios

3. Live Streaming Platform Inke: How to Build an Immersive Metaverse Karaoke Rapidly
4. WWDC 2022 - Increasingly Interwinding Virtuality and Reality

5. A First Glimpse into the Metaverse—Oasis Enables More People to Find a Second Life

6. ZEGOCLOUD Avatar, an Indispensable Key Piece of the Metaverse Puzzle!



Building stable and high-quality cloud streaming services for real-time audio and video communications.