Apple’s new GAUDI AI turns text prompts into 3D scenes – | Hot Mobile Press

Image: apple

The article can only be displayed with activated JavaScript. Please enable JavaScript in your browser and reload the page.

Apple shows its latest AI system GAUDI. It can generate 3D interior scenes and is the basis for a new generation of generative AI based on NeRFs.

So-called neural rendering brings artificial intelligence into computer graphics: AI researchers from Nvidia show, for example, how 3D objects are created from photos, Google relies on Neural Radiance Fields (NeRFs) for Immersive View or develops NeRFs for rendering people.

So far, NeRFs are mainly used as a type of neural storage medium for 3D models and 3D scenes, which can then be rendered from different camera perspectives. This is how the frequently shown tracking shots through a room or around an object are created. The first experiments with NeRFs for virtual reality experiences are also underway.

NeRFs could become the next level of generative artificial intelligence

But what if NeRF’s ability to render images photorealistic and from multiple angles could be leveraged for generative AI? AI systems such as OpenAI’s DALL-E 2 or Google’s Imagen and Parti show the potential of controllable generative AI, but only for 2D images and graphics.

Google gave the first glimpse of the 3D AI generation at the end of 2021 with Dream Fields, an AI system that combines the ability of NeRF to generate 3D views with the ability of OpenAI CLIP to evaluate content from images. The result: Dream Fields generates NeRFs that match textual descriptions.

Now introduces Apple’s AI team GAUDI, a neural architecture for generating immersive 3D scenes. The AI ​​system can create 3D scenes based on text prompts.

Apple GAUDI is a specialist in 3D interiors

While Google, for example, has dedicated itself to generating individual objects with Dream Fields, the extension of generative AIs to fully unrestricted 3D scenes remains an unsolved problem.

One reason for this is the limitation of possible camera positions: while for a single object every possible reasonable camera position can be mapped onto a dome, in 3D scenes these camera positions are limited by obstacles such as objects and walls. If these are not taken into account when generating the scene, the generated 3D scene cannot be used.

Apple’s GAUDI model solves this problem with three specialized networks: a Camera pose decoder makes predictions for possible camera positions and ensures that the output is a valid position for the 3D scene architecture.


That scene decoder for the scene provides a tri-plane representation, which is a kind of 3D canvas on which the Radiation Field Decoder draws the image below using the volumetric rendering equation.

In experiments with four different datasets, including ARKitScences, a dataset of interior scans, the researchers show that GAUDI can reconstruct learned views and approaches the quality of existing approaches.

Video: Miguel Angel Bautista via Twitter

Apple also demonstrates that GAUDI can create new camera movements through 3D interior scenes. Generation can be random, based on an image, or driven by text input with a text encoder – for example, “Walk down the hallway” or “Go up the stairs.”

The quality of the video generated by GAUDI is still low and full of artifacts. But with its AI system, Apple is laying another foundation for generative AI systems capable of rendering 3D objects and scenes. One possible application: generating digital locations for Apple’s XR headset.

Sources: Github, Arxiv (paper)

Leave a Comment