The latest technology and digital news on the web

Human-centric AI news and analysis

How Nvidia’s Maxine uses AI to advance video calls

One of the things that caught my eye at Nvidia’s flagship event, the GPU Technology Appointment (GTC), was Maxine, a belvedere that leverages bogus intelligence to advance the affection and acquaintance of video-conferencing applications in real-time.

Maxine used deep acquirements for resolution improvement, accomplishments noise reduction, video compression, face alignment, and real-time adaptation and transcription.

In this post, which marks the first accession of our “deconstructing bogus intelligence” series, we will take a look at how some of these appearance work and how they tie-in with AI analysis done at Nvidia. We’ll also analyze the awaiting issues and the accessible business model for Nvidia’s AI-powered video-conferencing platform.

Super-resolution with neural networks

The first affection shown in the Maxine presentation is “super resolution,” which according to Nvidia, “can catechumen lower resolutions to higher resolution videos in real time.” Super resolution enables video-conference callers to send lo-res video streams and have them upscaled at the server. This reduces the bandwidth claim of video appointment applications and can make their achievement more stable in areas where arrangement connectivity is not very stable.

The big claiming of upscaling visual data is bushing in the missing information. You have a bound array of pixels that represent an image, and you want to expand it to a larger canvas that contains many more pixels. How do you decide what color values those new pixels get?


Old upscaling techniques use altered departure methods (bicubic, lanczos, etc.) to fill the space amid pixels. These techniques are too accepted and might accommodate mixed after-effects in altered types of images and backgrounds.

One of the allowances of apparatus acquirements algorithms is that they can be tuned to accomplish very specific tasks. For instance, a deep neural arrangement can be accomplished on scaled-down video frames affective from video appointment streams and their agnate hi-res aboriginal images. With enough examples, the neural arrangement will tune its ambit to the accepted appearance found in video-conference visual data (mostly faces) and will be able to accommodate a better low- to hi-res about-face than general-purpose upscaling algorithms. In general, the more narrow the domain, the better the affairs of the neural arrangement to advancing on a very high accurateness performance.

There’s already a solid body of analysis on using bogus neural networks for upscaling visual data, including a 2017 Nvidia paper that discusses accepted super resolution with deep neural networks. With video-conferencing being a very specialized case, a acquiescent neural arrangement is bound to accomplish even better than more accepted tasks. Aside from video conferencing, there are applications for this technology in other areas, such as the film industry, which uses deep acquirements to remaster old videos to higher quality.

Video compression with neural networks

One of the more absorbing parts of the Maxine presentation was the AI video compression feature. The video posted on Nvidia’s YouTube shows that using neural networks to abbreviate video streams reduces bandwidth from ~97 KB/frame to ~0.12 KB/frame, which is a bit exaggerated, as users have acicular out on Reddit. Nvidia’s website states developers can reduce bandwidth use down to “one-tenth of the bandwidth needed for the H.264 video compression standard,” which is a much more reasonable—and still impressive—figure.

How does Nvidia’s AI accomplish such absorbing compression rates? A blog post on Nvidia’s website provides more detail on how the technology works. A neural arrangement extracts and encodes the locations of key facial appearance of the user for each frame, which is much more able than burden pixel and color data. The encoded data is then passed on to a abundant adversarial arrangement along with a advertence video frame captured at the alpha of the session. The GAN is accomplished to reconstruct the new image by bulging the facial appearance onto the advertence frame.

AI video compression
Deep neural networks abstract and encode key facial features. Abundant adversarial networks than activity those encodings on a advertence frame with the user’s face

The work builds up on antecedent GAN analysis done at Nvidia, which mapped rough sketches to rich, abundant images and drawings.

The AI video compression shows once again how narrow domains accommodate accomplished settings for the use of deep acquirements algorithms.

Face alteration with deep learning

The face alignment affection readjusts the angle of users’ faces to make it appear as if they’re attractive anon at the camera. This is a botheration that is very common in video conferencing because people tend to look at the faces of others on the screen rather than gaze at the camera.

NVidia AI face alignment

Although there isn’t much detail about how this works, the blog post mentions that they use GANs. It’s not hard to see how this affection can be arranged with the AI compression/decompression technology. Nvidia has already done all-encompassing analysis on battleground apprehension and encoding, including the abstraction of facial appearance and gaze administration at altered angles. The encodings can be fed to the same GAN that projects the facial appearance onto the advertence image and let it do the rest.

Where does Maxine run its deep acquirements models?

There are a lot of other neat appearance in Maxine, including the affiliation with JARVIS, Nvidia’s communicative AI platform. Getting into all of that would be beyond the scope of this article.

But some abstruse issues remain to be resolved. For instance, one issue is how much of Maxine’s functionalities will run on cloud servers and how much of it on user devices. In acknowledgment to a query from , a agent for Nvidia said, “NVIDIA Maxine is advised to assassinate the AI appearance in the cloud so that every user access them, behindhand of the device they’re using.”

This makes sense for some of the appearance such as super resolution, basic background, auto-frame, and noise reduction. But it seems absurd for others. Take, for example, the AI video compression example. Ideally, the neural arrangement doing the facial announcement encoding must run on the sender’s device, and the GAN that reconstruct the video frame must run on the receiver’s device. If all these functions are being agitated out on servers, there would be no bandwidth savings, because users would send and accept full frames instead of the much lighter facial announcement encodings.

Ideally, there should be some sort of agreement that allows users to choose the right antithesis amid local and on-cloud AI inference to strike the right antithesis amid arrangement and compute availabilities. For instance, a user who has a workstation with a strong GPU card might want to run all deep acquirements models on their computer in barter for lower bandwidth usage or cost savings. On the other hand, a user abutting a appointment from a mobile device with low processing power would forgo the local AI compression and defer basic accomplishments and noise abridgement to the Maxine server.

What is Maxine’s business model?

NVidia AI stack

With the covid-19 communicable blame companies to apparatus remote-working protocols, it seems as good a time as any to market video-conferencing apps. And with AI still being in the climax of its hype season, companies have a addiction to rebrand their articles as “AI-powered” to advance sales. So, I’m about a bit agnostic about annihilation that has “video conferencing” and “AI” in its name these days, and I think many of them will not live up to the promise.

But I have a few affidavit to accept Nvidia’s Maxine will accomplish where others fail. First, Nvidia has a track record of doing reliable deep acquirements research, abnormally in computer vision and more afresh in accustomed accent processing. The aggregation also has the basement and banking means to abide to advance and advance its AI models and make them accessible to its customers. Nvidia’s GPU servers and its partnerships with cloud providers will enable it to scale as its chump base grows. And its recent accretion of mobile chipmaker ARM will put it in a acceptable position to move some of these AI capabilities to the edge (maybe a Maxine-powered video-conferencing camera in the future?).

Finally, Maxine is an ideal archetype of narrow AI being put to good use. As against to computer vision applications that try to abode a wide range of issues, all of Maxine’s appearance are tailored for a appropriate setting: a person talking to a camera. As assorted abstracts have shown, even the most avant-garde deep acquirements algorithms lose their accurateness and adherence as their botheration domain expands. Reciprocally, neural networks are more likely to abduction the real data administration as its botheration domain becomes narrower.

But as we’ve seen on these pages before, there’s a huge aberration amid an absorbing piece of technology that works and one that has a acknowledged business model.

Maxine is currently in early access mode, so a lot of things might change in the future. For the moment, Nvidia plans to make it accessible as an SDK and a set of APIs hosed on Nvidia’s servers that developers can accommodate into their video-conferencing applications. Corporate video conferencing already has two big players, Teams and Zoom. Teams already has plenty of AI-powered appearance and it wouldn’t be hard for Microsoft to add some of the functionalities Maxine offers.

What will be the final appraisement model for Maxine? Will the allowances provided by the bandwidth accumulation be enough to absolve those costs? Will there be incentives for large players such as Zoom and Microsoft teams to accomplice with Nvidia, or will they add their own versions of the same features? Will Nvidia abide with the SDK/API model or advance its own standalone video-conferencing platform? Nvidia will have to answer these and many other questions as developers analyze its new AI-powered video-conferencing platform.

This commodity was originally appear by Ben Dickson on TechTalks, a advertisement that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also altercate the evil side of technology, the darker implications of new tech and what we need to look out for. You can read the aboriginal commodity here.

Appear October 30, 2020 — 08:25 UTC

Hottest related news

No articles found on this category.